saulpw commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171415516

   > In which datasets? Unless there are hundreds/thousands of fields per 
object, it doesn't sound that likely.
   
   Per the original issue ARROW-9612:
   
   > [abhishek bharani] I got this issue when json record was on single line 
with col length 2097323. 
   
   > [Pere-LluĂ­s Huguet Cabot] I have added a file that prompts the same error. 
It is a dump of wikipedia abstracts with wikidata information.
   
   So we don't have any "real" data on distribution of sizes, but we now have 3 
people who encountered this error and felt enough friction to post an issue on 
JIRA (or submit an actual PR in my case).  Given what I know about how few 
people engage with OSS vs just suck it up and feel a bit worse about the 
project, this is significant enough for me to say "let's bump the default ~10x" 
as an immediate workaround.  One order of magnitude is a generally reasonable 
heuristic to bump in cases like this, so let's not entertain slippery slope 
arguments.  I did debate myself a bit about 8MB, 10MB, and 16MB, and if e.g. 
8MB would make @pitrou feel better about cache locality, I can change the PR to 
that so we can move on from arguing about the workaround to implementing a 
longer-term fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to