[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2020-01-15 Thread ShivaKumar SS (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016200#comment-17016200
 ] 

ShivaKumar SS commented on SPARK-26570:
---

Unfortunately I have hit the same issue. 

[https://stackoverflow.com/questions/59757268/spark-read-multiple-column-partitioned-csv-files-out-of-memory]

Is there any workaround for this. ? 

 

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, 
> image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, 
> image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, 
> screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-11-05 Thread Gautam Pulla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967947#comment-16967947
 ] 

Gautam Pulla commented on SPARK-26570:
--

I'm hitting a similar issue - but at a more specific line of code - perhaps the 
following information will help.

When enumerating a large number of files (2-3 million files) in S3, we see an 
OOM at the line of code below which is concatenating all file-paths into a 
single string to log them. With 2-3 million paths (each 150 chars), the string 
would be 400 million characters - and represented as a JVM multi-byte-per-char 
unicode string, that can be close to 1 Gig of memory just to log the string. 
With fewer files - say 400k files - the concatenation/logging is successful - 
but it's still an annoyance as the output log contains a huge string with all 
the enumerated paths. 
{code:java}
class InMemoryFileIndex(

...

private[sql] def bulkListLeafFiles(
paths: Seq[Path],
hadoopConf: Configuration,
filter: PathFilter,
sparkSession: SparkSession): Seq[(Path, Seq[FileStatus])] = {

  // Short-circuits parallel listing when serial listing is likely to be faster.
  if (paths.size <= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
return paths.map { path =>
  (path, listLeafFiles(path, hadoopConf, filter, Some(sparkSession)))
}
  }

  logInfo(s"Listing leaf files and directories in parallel under: 
${paths.mkString(", ")}")  <<< Log line printing all files 
{code}

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, 
> image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, 
> image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, 
> screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-10-14 Thread fengchaoge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951437#comment-16951437
 ] 

fengchaoge commented on SPARK-26570:


It's just a snapshot of an instant, and it's actually getting bigger until the 
memory overflows and any information about the production environment can be 
exported。

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, 
> image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, 
> image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, 
> screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-10-14 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951223#comment-16951223
 ] 

L. C. Hsieh commented on SPARK-26570:
-

First from the stacktrace, I was thinking it is possible that your 
SerializableFileStatus is too many. Then it is possible to cause OOM when 
transforming them back to Status. The PR was created for that.

But from your latest posts, from jmap logs, looks like SerializableFileStatus 
do not hold too much memory. Are you sure SerializableFileStatus is the cause 
of the OOM?

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, 
> image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, 
> image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, 
> screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-10-13 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950686#comment-16950686
 ] 

Sean R. Owen commented on SPARK-26570:
--

That doesn't help much, especially as it's a screen shot? They're 67 bytes 
each, an 7 of them can't be large. How about a histogram of what's on the 
heap?

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, 
> image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, 
> image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, 
> screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-10-13 Thread fengchaoge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950675#comment-16950675
 ] 

fengchaoge commented on SPARK-26570:


!image-2019-10-14-10-51-28-374.png!

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, 
> image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, 
> image-2019-10-14-10-50-47-567.png, screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-10-13 Thread fengchaoge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950672#comment-16950672
 ] 

fengchaoge commented on SPARK-26570:


!image-2019-10-14-10-48-05-063.png!

!image-2019-10-14-10-48-16-632.png!

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, 
> image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, 
> screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-10-13 Thread fengchaoge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950658#comment-16950658
 ] 

fengchaoge commented on SPARK-26570:


hello, [~srowen] [~viirya]

jmap logs is shown below:

!image-2019-10-14-10-00-27-361.png!

The dump log is too big, I will upload it later

 

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, 
> screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-10-13 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950406#comment-16950406
 ] 

L. C. Hsieh commented on SPARK-26570:
-

I think the collected number of filestatus can be more than 70K, considering 
bulkListLeafFiles lists leaf files under paths recursively? Yes, a heap dump 
can be more helpful.

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-10-13 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950370#comment-16950370
 ] 

Sean R. Owen commented on SPARK-26570:
--

Not sure, those objects aren't large, and 70K of them doesn't seem like a lot. 
The method here is meant to return all of their statuses. Is it possible 
something else is eating up most of the driver memory? a heap dump is more 
informative.

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-10-13 Thread fengchaoge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950276#comment-16950276
 ] 

fengchaoge commented on SPARK-26570:


hello, [~deshanxiao] [~srowen] spark2.4.3 may also have the same problems. Our 
sql program runs stably on spark2.1.0, generating about 70,000  tasks . After 
migrating to spark2.4.3, the memory overflows directly.  Driver logs like 
this:java.lang.OutOfMemoryError: GC overhead limit exceeded. The jstack log is 
shown below, may  be all  serializable  file statuses collected.

_!image-2019-10-13-18-41-22-090.png!_

!image-2019-10-13-18-45-33-770.png!

 

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782137#comment-16782137
 ] 

Sean Owen commented on SPARK-26570:
---

How big can these be? are you saying they're large, or that they leak?


> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-01-09 Thread deshanxiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739083#comment-16739083
 ] 

deshanxiao commented on SPARK-26570:


[~hyukjin.kwon] OK, I will try it. Thank you!

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-01-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738955#comment-16738955
 ] 

Hyukjin Kwon commented on SPARK-26570:
--

Would you be able to test this in upper version of Spark?

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-01-08 Thread deshanxiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737061#comment-16737061
 ] 

deshanxiao commented on SPARK-26570:


 !screenshot-1.png! 

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: screenshot-1.png
>
>
> The bulkListLeafFiles will collect all filestatus in memory for every query 
> which may cause the oom of driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org