saulpw commented on PR #13442: URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171415516
> In which datasets? Unless there are hundreds/thousands of fields per object, it doesn't sound that likely. Per the original issue ARROW-9612: > [abhishek bharani] I got this issue when json record was on single line with col length 2097323. > [Pere-LluĂs Huguet Cabot] I have added a file that prompts the same error. It is a dump of wikipedia abstracts with wikidata information. So we don't have any "real" data on distribution of sizes, but we now have 3 people who encountered this error and felt enough friction to post an issue on JIRA (or submit an actual PR in my case). Given what I know about how few people engage with OSS vs just suck it up and feel a bit worse about the project, this is significant enough for me to say "let's bump the default ~10x" as an immediate workaround. One order of magnitude is a generally reasonable heuristic to bump in cases like this, so let's not entertain slippery slope arguments. I did debate myself a bit about 8MB, 10MB, and 16MB, and if e.g. 8MB would make @pitrou feel better about cache locality, I can change the PR to that so we can move on from arguing about the workaround to implementing a longer-term fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
