[jira] [Resolved] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ewan Higgs resolved SPARK-21817. Resolution: Invalid This was caused by a change in a stable/evolving interface which previously accepted null. This should continue to accept null, so it will be fixed in HDFS-12344. > Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex > -- > > Key: SPARK-21817 > URL: https://issues.apache.org/jira/browse/SPARK-21817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ewan Higgs >Priority: Minor > Attachments: SPARK-21817.001.patch > > > The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to > pull out the ACL and other information. Therefore passing in a {{null}} is no > longer adequate and hence causes a NPE when listing files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138905#comment-16138905 ] Ewan Higgs commented on SPARK-21817: {quote} Ewan: do a patch there with a new test method (where?) & I'll review it. {quote} Sure. Sorry for the bug report on Spark, all. I'll fix in HDFS. > Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex > -- > > Key: SPARK-21817 > URL: https://issues.apache.org/jira/browse/SPARK-21817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ewan Higgs >Priority: Minor > Attachments: SPARK-21817.001.patch > > > The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to > pull out the ACL and other information. Therefore passing in a {{null}} is no > longer adequate and hence causes a NPE when listing files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138253#comment-16138253 ] Ewan Higgs commented on SPARK-21817: {quote}Can this be accomplished with a change that's still compatible with 2.6?{quote} Yes I believe it should just work. The argument exists in the function call; {{InMemoryFileIndex}} is just passing {{null}} currently. > Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex > -- > > Key: SPARK-21817 > URL: https://issues.apache.org/jira/browse/SPARK-21817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ewan Higgs >Priority: Minor > Attachments: SPARK-21817.001.patch > > > The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to > pull out the ACL and other information. Therefore passing in a {{null}} is no > longer adequate and hence causes a NPE when listing files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ewan Higgs updated SPARK-21817: --- Attachment: SPARK-21817.001.patch Attaching simple fix that will no longer NPE on Hadoop head. > Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex > -- > > Key: SPARK-21817 > URL: https://issues.apache.org/jira/browse/SPARK-21817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ewan Higgs > Fix For: 2.3.0 > > Attachments: SPARK-21817.001.patch > > > The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to > pull out the ACL and other information. Therefore passing in a {{null}} is no > longer adequate and hence causes a NPE when listing files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
Ewan Higgs created SPARK-21817: -- Summary: Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex Key: SPARK-21817 URL: https://issues.apache.org/jira/browse/SPARK-21817 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Ewan Higgs Fix For: 2.3.0 The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to pull out the ACL and other information. Therefore passing in a {{null}} is no longer adequate and hence causes a NPE when listing files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint
[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157105#comment-15157105 ] Ewan Higgs commented on SPARK-13434: Hi Sean, This is using jmap with the {{-histo:live}} argument which, I thought, was for live objects only. If it's dumping non live objects too, and you have a way to check only live objects, let me know and I'll be happy to re-run the job. {quote} I'm missing what you're proposing – what is the opportunity to reduce memory usage? {quote} I'm trying to track the outstanding work from the Github issue. [~josephkb] suggest there: {quote} For 3, I should have been more specific. Tungsten makes improvements on DataFrames, so it should improve the performance of simple ML Pipeline operations like feature transformation and prediction. However, to get the same benefits for model training, we'll need to rewrite the algorithms to use DataFrames and not RDDs. Future work... {quote} So one proposal is to reimplement RF in terms of DataFrames. Aside from that, I do see that many of the objects are small and suffer from JVM overhead. e.g. Predict is a pair of doubles yet it consumes 32 bytes. In a native runtime it could be 8 bytes (a pair of floats). Node consumes 52 bytes, but it looks like it should be possible to contain this in 41 bytes (int + (float, float) + float + bool + ptr + ptr + ptr). Another issue is the concurrency. If there are multiple threads working within an executor, both creating trees, then they are all consuming memory at the same time. This is a common issue in R when using papply. Reducing the concurrency can help reduce memory pressure. > Reduce Spark RandomForest memory footprint > -- > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: heap-usage.log, rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live }}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > -- >1: 5428073 8458773496 [D >2: 12293653 4124641992 [I >3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node >4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict >5: 72853787 1165660592 scala.Some >6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats >7: 72969 390492744 [B >8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator >9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14:380804 45361144 [C > 15:268887 34877128 > 16:268887 34431568 > 17:908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 110 2640 > org.apache.spark.mllib.regression.LabeledPoint > 19: 110 2640 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 > 21: 100 2400 org.apache.spark.mllib.tree.impl.TreePoint > 22: 100 2400 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23:908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 > 25: 17023 14380352 > 26:16 13308288 > [Lorg.apach
[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint
[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156921#comment-15156921 ] Ewan Higgs commented on SPARK-13434: SPARK-3728 is titled with a similar intent to this, but the issue description immediately sets out to discuss writing data to disk to handle out of memory data. This ticket is more focused on reducing the memory used. > Reduce Spark RandomForest memory footprint > -- > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: heap-usage.log, rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live }}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > -- >1: 5428073 8458773496 [D >2: 12293653 4124641992 [I >3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node >4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict >5: 72853787 1165660592 scala.Some >6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats >7: 72969 390492744 [B >8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator >9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14:380804 45361144 [C > 15:268887 34877128 > 16:268887 34431568 > 17:908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 110 2640 > org.apache.spark.mllib.regression.LabeledPoint > 19: 110 2640 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 > 21: 100 2400 org.apache.spark.mllib.tree.impl.TreePoint > 22: 100 2400 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23:908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 > 25: 17023 14380352 > 26:16 13308288 > [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; > 27:445797 10699128 scala.Tuple2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13434) Reduce Spark RandomForest memory footprint
[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ewan Higgs updated SPARK-13434: --- Attachment: heap-usage.log Heap usage of RandomForest sampled with {{jmap -histo:live }} every 5 seconds. > Reduce Spark RandomForest memory footprint > -- > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: heap-usage.log, rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live }}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > -- >1: 5428073 8458773496 [D >2: 12293653 4124641992 [I >3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node >4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict >5: 72853787 1165660592 scala.Some >6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats >7: 72969 390492744 [B >8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator >9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14:380804 45361144 [C > 15:268887 34877128 > 16:268887 34431568 > 17:908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 110 2640 > org.apache.spark.mllib.regression.LabeledPoint > 19: 110 2640 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 > 21: 100 2400 org.apache.spark.mllib.tree.impl.TreePoint > 22: 100 2400 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23:908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 > 25: 17023 14380352 > 26:16 13308288 > [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; > 27:445797 10699128 scala.Tuple2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13434) Reduce Spark RandomForest memory footprint
[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ewan Higgs updated SPARK-13434: --- Attachment: rf-heap-usage.png JConsole output of memory use with 1.3G file. > Reduce Spark RandomForest memory footprint > -- > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live }}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > -- >1: 5428073 8458773496 [D >2: 12293653 4124641992 [I >3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node >4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict >5: 72853787 1165660592 scala.Some >6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats >7: 72969 390492744 [B >8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator >9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14:380804 45361144 [C > 15:268887 34877128 > 16:268887 34431568 > 17:908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 110 2640 > org.apache.spark.mllib.regression.LabeledPoint > 19: 110 2640 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 > 21: 100 2400 org.apache.spark.mllib.tree.impl.TreePoint > 22: 100 2400 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23:908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 > 25: 17023 14380352 > 26:16 13308288 > [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; > 27:445797 10699128 scala.Tuple2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13434) Reduce Spark RandomForest memory footprint
Ewan Higgs created SPARK-13434: -- Summary: Reduce Spark RandomForest memory footprint Key: SPARK-13434 URL: https://issues.apache.org/jira/browse/SPARK-13434 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.6.0 Environment: Linux Reporter: Ewan Higgs The RandomForest implementation can easily run out of memory on moderate datasets. This was raised in the a user's benchmarking game on github (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there was a tracking issue, but I couldn't fine one. Using Spark 1.6, a user of mine is running into problems running the RandomForest training on largish datasets on machines with 64G memory and the following in {{spark-defaults.conf}}: {code} spark.executor.cores 2 spark.executor.instances 199 spark.executor.memory 10240M {code} I reproduced the excessive memory use from the benchmark example (using an input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell --driver-memory 30G --executor-memory 30G}} and have a heap profile from a single machine by running {{jmap -histo:live }}. I took a sample every 5 seconds and at the peak it looks like this: {code} num #instances #bytes class name -- 1: 5428073 8458773496 [D 2: 12293653 4124641992 [I 3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node 4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict 5: 72853787 1165660592 scala.Some 6: 16263408 910750848 org.apache.spark.mllib.tree.model.InformationGainStats 7: 72969 390492744 [B 8: 3327008 133080320 org.apache.spark.mllib.tree.impl.DTStatsAggregator 9: 3754500 120144000 scala.collection.immutable.HashMap$HashMap1 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split 11: 3534946 84838704 org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo 12: 3764745 60235920 java.lang.Integer 13: 3327008 53232128 org.apache.spark.mllib.tree.impurity.EntropyAggregator 14:380804 45361144 [C 15:268887 34877128 16:268887 34431568 17:908377 34042760 [Lscala.collection.immutable.HashMap; 18: 110 2640 org.apache.spark.mllib.regression.LabeledPoint 19: 110 2640 org.apache.spark.mllib.linalg.SparseVector 20: 20206 25979864 21: 100 2400 org.apache.spark.mllib.tree.impl.TreePoint 22: 100 2400 org.apache.spark.mllib.tree.impl.BaggedPoint 23:908332 21799968 scala.collection.immutable.HashMap$HashTrieMap 24: 20206 20158864 25: 17023 14380352 26:16 13308288 [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; 27:445797 10699128 scala.Tuple2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files
[ https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037593#comment-15037593 ] Ewan Higgs commented on SPARK-5836: --- [~tdas] {quote} The only case there may be issues is when the external shuffle service is used. {quote} I see this problematic behaviour in ipython/pyspark notebooks. We can try to go through and unpersist and checkpoint and so on with the RDDs but the shuffle files don't seem to go away. We see this even though we are not using the external shuffle service. > Highlight in Spark documentation that by default Spark does not delete its > temporary files > -- > > Key: SPARK-5836 > URL: https://issues.apache.org/jira/browse/SPARK-5836 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Tomasz Dudziak >Assignee: Ilya Ganelin >Priority: Minor > Fix For: 1.3.1, 1.4.0 > > > We recently learnt the hard way (in a prod system) that Spark by default does > not delete its temporary files until it is stopped. WIthin a relatively short > time span of heavy Spark use the disk of our prod machine filled up > completely because of multiple shuffle files written to it. We think there > should be better documentation around the fact that after a job is finished > it leaves a lot of rubbish behind so that this does not come as a surprise. > Probably a good place to highlight that fact would be the documentation of > {{spark.local.dir}} property, which controls where Spark temporary files are > written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems
[ https://issues.apache.org/jira/browse/SPARK-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ewan Higgs resolved SPARK-5300. --- Resolution: Won't Fix I submitted a fix at the FileSystem level based on comments in the mailing list. The patch was rejected because it's expected that anyone implementing a file input format should make sure the files are loaded in order. They can do that by overriding the listStatus function as follows: {code} // Sort the file pieces since order matters. override def listStatus(job: JobContext): List[FileStatus] = { val listing = super.listStatus(job) val sortedListing= listing.sortWith{ (lhs, rhs) => { lhs.getPath().compareTo(rhs.getPath()) < 0 } } sortedListing.toList } {code} > Spark loads file partitions in inconsistent order on native filesystems > --- > > Key: SPARK-5300 > URL: https://issues.apache.org/jira/browse/SPARK-5300 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.1.0, 1.2.0 > Environment: Linux, EXT4, for example. >Reporter: Ewan Higgs > > Discussed on user list in April 2014: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html > And on dev list January 2015: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html > When using a file system which isn't HDFS, file partitions ('part-0, > part-1', etc.) are not guaranteed to load in the same order. This means > previously sorted RDDs will be loaded out of order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems
[ https://issues.apache.org/jira/browse/SPARK-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293201#comment-14293201 ] Ewan Higgs commented on SPARK-5300: --- The PR appears to have been rejected on the grounds that all FileInputFormats that want sorting should sort the file parts. Perhaps this could be documented more clearly. > Spark loads file partitions in inconsistent order on native filesystems > --- > > Key: SPARK-5300 > URL: https://issues.apache.org/jira/browse/SPARK-5300 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.1.0, 1.2.0 > Environment: Linux, EXT4, for example. >Reporter: Ewan Higgs > > Discussed on user list in April 2014: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html > And on dev list January 2015: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html > When using a file system which isn't HDFS, file partitions ('part-0, > part-1', etc.) are not guaranteed to load in the same order. This means > previously sorted RDDs will be loaded out of order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems
Ewan Higgs created SPARK-5300: - Summary: Spark loads file partitions in inconsistent order on native filesystems Key: SPARK-5300 URL: https://issues.apache.org/jira/browse/SPARK-5300 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.2.0, 1.1.0 Environment: Linux, EXT4, for example. Reporter: Ewan Higgs Discussed on user list in April 2014: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html And on dev list January 2015: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html When using a file system which isn't HDFS, file partitions ('part-0, part-1', etc.) are not guaranteed to load in the same order. This means previously sorted RDDs will be loaded out of order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org