Ryan Stalets created ARROW-13318:
------------------------------------

             Summary: kMaxParserNumRows Value Increase/Removal
                 Key: ARROW-13318
                 URL: https://issues.apache.org/jira/browse/ARROW-13318
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
            Reporter: Ryan Stalets


I'm a new pyArrow user and have been investigating occasional errors related to 
the Python exception: "ArrowInvalid: Exceeded maximum rows" when parsing JSON 
line files using pyarrow.json.read_json(). In digging in, it looks like the 
original source of this exception is in cpp/src/arrow/json/parser.cc on line 
703, which appears to throw the error when the number of lines processed 
exceeds kMaxParserNumRows.

 
{code:java}
for (; num_rows_ < kMaxParserNumRows; ++num_rows_) {
      auto ok = reader.Parse<parse_flags>(json, handler);
      switch (ok.Code()) {
        case rj::kParseErrorNone:
          // parse the next object
          continue;
        case rj::kParseErrorDocumentEmpty:
          // parsed all objects, finish
          return Status::OK();
        case rj::kParseErrorTermination:
          // handler emitted an error
          return handler.Error();
        default:
          // rj emitted an error
          return ParseError(rj::GetParseError_En(ok.Code()), " in row ", 
num_rows_);
      }
    }
    return Status::Invalid("Exceeded maximum rows");
  }{code}
 

 

This constant appears to be set in arrow/json/parser.h on line 53, and has been 
set this way since that file's initial commit.

 
{code:java}
constexpr int32_t kMaxParserNumRows = 100000;{code}
 

 

There does not appear to be a comment in the code or in the commit or PR 
explaining this maximum number of lines.

 

I'm wondering what the reason for this maximum might be, and if it might be 
removed, increased, or made overridable in the C++ and the upstream Python. It 
is common to need to process JSON files of arbitrary length (logs from 
applications, third-party vendors, etc) where the user of the data does not 
have control over the size of the file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to