[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15649132#comment-15649132 ]
Nicholas Chammas commented on SPARK-18367: ------------------------------------------ On 2.0.x the caching is required due to SPARK-18254, which is fixed in 2.1+. On master I get the same "Too many open files in system" error without the caching if I remove the limit(). > limit() makes the lame walk again > --------------------------------- > > Key: SPARK-18367 > URL: https://issues.apache.org/jira/browse/SPARK-18367 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.1, 2.1.0 > Environment: Python 3.5, Java 8 > Reporter: Nicholas Chammas > Attachments: plan-with-limit.txt, plan-without-limit.txt > > > I have a complex DataFrame query that fails to run normally but succeeds if I > add a dummy {{limit()}} upstream in the query tree. > The failure presents itself like this: > {code} > ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial > writes to file > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > java.io.FileNotFoundException: > /private/var/folders/f5/t48vxz555b51mr3g6jjhxv400000gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc > (Too many open files in system) > {code} > My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on > macOS. However, I don't think that's the issue, since if I add a dummy > {{limit()}} early on the query tree -- dummy as in it does not actually > reduce the number of rows queried -- then the same query works. > I've diffed the physical query plans to see what this {{limit()}} is actually > doing, and the diff is as follows: > {code} > diff plan-with-limit.txt plan-without-limit.txt > 24,28c24 > < : : : +- *GlobalLimit 1000000 > < : : : +- Exchange SinglePartition > < : : : +- *LocalLimit 1000000 > < : : : +- *Project [...] > < : : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > > : : : +- *Scan orc [...] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > 49,53c45 > < : : +- *GlobalLimit 1000000 > < : : +- Exchange SinglePartition > < : : +- *LocalLimit 1000000 > < : : +- *Project [...] > < : : +- *Scan orc [...] > Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct<... > --- > > : : +- *Scan orc [] Format: ORC, > > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: > > struct<... > {code} > Does this give any clues as to why this {{limit()}} is helping? Again, the > 1000000 limit you can see in the plan is much higher than the cardinality of > the dataset I'm reading, so there is no theoretical impact on the output. You > can see the full query plans attached to this ticket. > Unfortunately, I don't have a minimal reproduction for this issue, but I can > work towards one with some clues. > I'm seeing this behavior on 2.0.1 and on master at commit > {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org