[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016200#comment-17016200 ] ShivaKumar SS commented on SPARK-26570: --- Unfortunately I have hit the same issue. [https://stackoverflow.com/questions/59757268/spark-read-multiple-column-partitioned-csv-files-out-of-memory] Is there any workaround for this. ? > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, > image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, > image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, > screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967947#comment-16967947 ] Gautam Pulla commented on SPARK-26570: -- I'm hitting a similar issue - but at a more specific line of code - perhaps the following information will help. When enumerating a large number of files (2-3 million files) in S3, we see an OOM at the line of code below which is concatenating all file-paths into a single string to log them. With 2-3 million paths (each 150 chars), the string would be 400 million characters - and represented as a JVM multi-byte-per-char unicode string, that can be close to 1 Gig of memory just to log the string. With fewer files - say 400k files - the concatenation/logging is successful - but it's still an annoyance as the output log contains a huge string with all the enumerated paths. {code:java} class InMemoryFileIndex( ... private[sql] def bulkListLeafFiles( paths: Seq[Path], hadoopConf: Configuration, filter: PathFilter, sparkSession: SparkSession): Seq[(Path, Seq[FileStatus])] = { // Short-circuits parallel listing when serial listing is likely to be faster. if (paths.size <= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { return paths.map { path => (path, listLeafFiles(path, hadoopConf, filter, Some(sparkSession))) } } logInfo(s"Listing leaf files and directories in parallel under: ${paths.mkString(", ")}") <<< Log line printing all files {code} > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, > image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, > image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, > screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951437#comment-16951437 ] fengchaoge commented on SPARK-26570: It's just a snapshot of an instant, and it's actually getting bigger until the memory overflows and any information about the production environment can be exported。 > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, > image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, > image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, > screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951223#comment-16951223 ] L. C. Hsieh commented on SPARK-26570: - First from the stacktrace, I was thinking it is possible that your SerializableFileStatus is too many. Then it is possible to cause OOM when transforming them back to Status. The PR was created for that. But from your latest posts, from jmap logs, looks like SerializableFileStatus do not hold too much memory. Are you sure SerializableFileStatus is the cause of the OOM? > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, > image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, > image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, > screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950686#comment-16950686 ] Sean R. Owen commented on SPARK-26570: -- That doesn't help much, especially as it's a screen shot? They're 67 bytes each, an 7 of them can't be large. How about a histogram of what's on the heap? > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, > image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, > image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, > screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950675#comment-16950675 ] fengchaoge commented on SPARK-26570: !image-2019-10-14-10-51-28-374.png! > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, > image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, > image-2019-10-14-10-50-47-567.png, screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950672#comment-16950672 ] fengchaoge commented on SPARK-26570: !image-2019-10-14-10-48-05-063.png! !image-2019-10-14-10-48-16-632.png! > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, > image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, > screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950658#comment-16950658 ] fengchaoge commented on SPARK-26570: hello, [~srowen] [~viirya] jmap logs is shown below: !image-2019-10-14-10-00-27-361.png! The dump log is too big, I will upload it later > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, > screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950406#comment-16950406 ] L. C. Hsieh commented on SPARK-26570: - I think the collected number of filestatus can be more than 70K, considering bulkListLeafFiles lists leaf files under paths recursively? Yes, a heap dump can be more helpful. > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950370#comment-16950370 ] Sean R. Owen commented on SPARK-26570: -- Not sure, those objects aren't large, and 70K of them doesn't seem like a lot. The method here is meant to return all of their statuses. Is it possible something else is eating up most of the driver memory? a heap dump is more informative. > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950276#comment-16950276 ] fengchaoge commented on SPARK-26570: hello, [~deshanxiao] [~srowen] spark2.4.3 may also have the same problems. Our sql program runs stably on spark2.1.0, generating about 70,000 tasks . After migrating to spark2.4.3, the memory overflows directly. Driver logs like this:java.lang.OutOfMemoryError: GC overhead limit exceeded. The jstack log is shown below, may be all serializable file statuses collected. _!image-2019-10-13-18-41-22-090.png!_ !image-2019-10-13-18-45-33-770.png! > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782137#comment-16782137 ] Sean Owen commented on SPARK-26570: --- How big can these be? are you saying they're large, or that they leak? > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739083#comment-16739083 ] deshanxiao commented on SPARK-26570: [~hyukjin.kwon] OK, I will try it. Thank you! > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738955#comment-16738955 ] Hyukjin Kwon commented on SPARK-26570: -- Would you be able to test this in upper version of Spark? > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737061#comment-16737061 ] deshanxiao commented on SPARK-26570: !screenshot-1.png! > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: screenshot-1.png > > > The bulkListLeafFiles will collect all filestatus in memory for every query > which may cause the oom of driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org