[jira] [Resolved] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Ewan Higgs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Higgs resolved SPARK-21817.

Resolution: Invalid

This was caused by a change in a stable/evolving interface which previously 
accepted null. This should continue to accept null, so it will be fixed in 
HDFS-12344.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Ewan Higgs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138905#comment-16138905
 ] 

Ewan Higgs commented on SPARK-21817:


{quote}
Ewan: do a patch there with a new test method (where?) & I'll review it.
{quote}
Sure.

Sorry for the bug report on Spark, all. I'll fix in HDFS.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Ewan Higgs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138253#comment-16138253
 ] 

Ewan Higgs commented on SPARK-21817:


{quote}Can this be accomplished with a change that's still compatible with 
2.6?{quote}
Yes I believe it should just work. The argument exists in the function call; 
{{InMemoryFileIndex}} is just passing {{null}} currently.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Ewan Higgs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Higgs updated SPARK-21817:
---
Attachment: SPARK-21817.001.patch

Attaching simple fix that will no longer NPE on Hadoop head.

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
> Fix For: 2.3.0
>
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-23 Thread Ewan Higgs (JIRA)
Ewan Higgs created SPARK-21817:
--

 Summary: Pass FSPermissions to LocatedFileStatus from 
InMemoryFileIndex
 Key: SPARK-21817
 URL: https://issues.apache.org/jira/browse/SPARK-21817
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ewan Higgs
 Fix For: 2.3.0


The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to pull 
out the ACL and other information. Therefore passing in a {{null}} is no longer 
adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint

2016-02-22 Thread Ewan Higgs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157105#comment-15157105
 ] 

Ewan Higgs commented on SPARK-13434:


Hi Sean,
This is using jmap with the {{-histo:live}} argument which, I thought, was for 
live objects only. If it's dumping non live objects too, and you have a way to 
check only live objects, let me know and I'll be happy to re-run the job.

{quote}
I'm missing what you're proposing – what is the opportunity to reduce memory 
usage?
{quote}

I'm trying to track the outstanding work from the Github issue. [~josephkb] 
suggest there:

{quote}
For 3, I should have been more specific. Tungsten makes improvements on 
DataFrames, so it should improve the performance of simple ML Pipeline 
operations like feature transformation and prediction. However, to get the same 
benefits for model training, we'll need to rewrite the algorithms to use 
DataFrames and not RDDs. Future work...
{quote}

So one proposal is to reimplement RF in terms of DataFrames.

Aside from that, I do see that many of the objects are small and suffer from 
JVM overhead. e.g. Predict is a pair of doubles yet it consumes 32 bytes. In a 
native runtime it could be 8 bytes (a pair of floats). Node consumes 52 bytes, 
but it looks like it should be possible to contain this in 41 bytes (int + 
(float, float) + float + bool + ptr + ptr + ptr).

Another issue is the concurrency. If there are multiple threads working within 
an executor, both creating trees, then they are all consuming memory at the 
same time. This is a common issue in R when using papply. Reducing the 
concurrency can help reduce memory pressure.

> Reduce Spark RandomForest memory footprint
> --
>
> Key: SPARK-13434
> URL: https://issues.apache.org/jira/browse/SPARK-13434
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Ewan Higgs
>  Labels: decisiontree, mllib, randomforest
> Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate 
> datasets. This was raised in the a user's benchmarking game on github 
> (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there 
> was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the 
> RandomForest training on largish datasets on machines with 64G memory and the 
> following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an 
> input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
> --driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
> single machine by running {{jmap -histo:live }}. I took a sample 
> every 5 seconds and at the peak it looks like this:
> {code}
>  num #instances #bytes  class name
> --
>1:   5428073 8458773496  [D
>2:  12293653 4124641992  [I
>3:  32508964 1820501984  org.apache.spark.mllib.tree.model.Node
>4:  53068426 1698189632  org.apache.spark.mllib.tree.model.Predict
>5:  72853787 1165660592  scala.Some
>6:  16263408  910750848  
> org.apache.spark.mllib.tree.model.InformationGainStats
>7: 72969  390492744  [B
>8:   3327008  133080320  
> org.apache.spark.mllib.tree.impl.DTStatsAggregator
>9:   3754500  120144000  
> scala.collection.immutable.HashMap$HashMap1
>   10:   3318349  106187168  org.apache.spark.mllib.tree.model.Split
>   11:   3534946   84838704  
> org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:   3764745   60235920  java.lang.Integer
>   13:   3327008   53232128  
> org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:380804   45361144  [C
>   15:268887   34877128  
>   16:268887   34431568  
>   17:908377   34042760  [Lscala.collection.immutable.HashMap;
>   18:   110   2640  
> org.apache.spark.mllib.regression.LabeledPoint
>   19:   110   2640  org.apache.spark.mllib.linalg.SparseVector
>   20: 20206   25979864  
>   21:   100   2400  org.apache.spark.mllib.tree.impl.TreePoint
>   22:   100   2400  
> org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:908332   21799968  
> scala.collection.immutable.HashMap$HashTrieMap
>   24: 20206   20158864  
>   25: 17023   14380352  
>   26:16   13308288  
> [Lorg.apach

[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint

2016-02-22 Thread Ewan Higgs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156921#comment-15156921
 ] 

Ewan Higgs commented on SPARK-13434:


SPARK-3728 is titled with a similar intent to this, but the issue description 
immediately sets out to discuss writing data to disk to handle out of memory 
data.

This ticket is more focused on reducing the memory used.

> Reduce Spark RandomForest memory footprint
> --
>
> Key: SPARK-13434
> URL: https://issues.apache.org/jira/browse/SPARK-13434
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Ewan Higgs
>  Labels: decisiontree, mllib, randomforest
> Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate 
> datasets. This was raised in the a user's benchmarking game on github 
> (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there 
> was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the 
> RandomForest training on largish datasets on machines with 64G memory and the 
> following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an 
> input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
> --driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
> single machine by running {{jmap -histo:live }}. I took a sample 
> every 5 seconds and at the peak it looks like this:
> {code}
>  num #instances #bytes  class name
> --
>1:   5428073 8458773496  [D
>2:  12293653 4124641992  [I
>3:  32508964 1820501984  org.apache.spark.mllib.tree.model.Node
>4:  53068426 1698189632  org.apache.spark.mllib.tree.model.Predict
>5:  72853787 1165660592  scala.Some
>6:  16263408  910750848  
> org.apache.spark.mllib.tree.model.InformationGainStats
>7: 72969  390492744  [B
>8:   3327008  133080320  
> org.apache.spark.mllib.tree.impl.DTStatsAggregator
>9:   3754500  120144000  
> scala.collection.immutable.HashMap$HashMap1
>   10:   3318349  106187168  org.apache.spark.mllib.tree.model.Split
>   11:   3534946   84838704  
> org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:   3764745   60235920  java.lang.Integer
>   13:   3327008   53232128  
> org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:380804   45361144  [C
>   15:268887   34877128  
>   16:268887   34431568  
>   17:908377   34042760  [Lscala.collection.immutable.HashMap;
>   18:   110   2640  
> org.apache.spark.mllib.regression.LabeledPoint
>   19:   110   2640  org.apache.spark.mllib.linalg.SparseVector
>   20: 20206   25979864  
>   21:   100   2400  org.apache.spark.mllib.tree.impl.TreePoint
>   22:   100   2400  
> org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:908332   21799968  
> scala.collection.immutable.HashMap$HashTrieMap
>   24: 20206   20158864  
>   25: 17023   14380352  
>   26:16   13308288  
> [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
>   27:445797   10699128  scala.Tuple2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13434) Reduce Spark RandomForest memory footprint

2016-02-22 Thread Ewan Higgs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Higgs updated SPARK-13434:
---
Attachment: heap-usage.log

Heap usage of RandomForest sampled with {{jmap -histo:live }} every 
5 seconds.

> Reduce Spark RandomForest memory footprint
> --
>
> Key: SPARK-13434
> URL: https://issues.apache.org/jira/browse/SPARK-13434
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Ewan Higgs
>  Labels: decisiontree, mllib, randomforest
> Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate 
> datasets. This was raised in the a user's benchmarking game on github 
> (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there 
> was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the 
> RandomForest training on largish datasets on machines with 64G memory and the 
> following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an 
> input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
> --driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
> single machine by running {{jmap -histo:live }}. I took a sample 
> every 5 seconds and at the peak it looks like this:
> {code}
>  num #instances #bytes  class name
> --
>1:   5428073 8458773496  [D
>2:  12293653 4124641992  [I
>3:  32508964 1820501984  org.apache.spark.mllib.tree.model.Node
>4:  53068426 1698189632  org.apache.spark.mllib.tree.model.Predict
>5:  72853787 1165660592  scala.Some
>6:  16263408  910750848  
> org.apache.spark.mllib.tree.model.InformationGainStats
>7: 72969  390492744  [B
>8:   3327008  133080320  
> org.apache.spark.mllib.tree.impl.DTStatsAggregator
>9:   3754500  120144000  
> scala.collection.immutable.HashMap$HashMap1
>   10:   3318349  106187168  org.apache.spark.mllib.tree.model.Split
>   11:   3534946   84838704  
> org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:   3764745   60235920  java.lang.Integer
>   13:   3327008   53232128  
> org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:380804   45361144  [C
>   15:268887   34877128  
>   16:268887   34431568  
>   17:908377   34042760  [Lscala.collection.immutable.HashMap;
>   18:   110   2640  
> org.apache.spark.mllib.regression.LabeledPoint
>   19:   110   2640  org.apache.spark.mllib.linalg.SparseVector
>   20: 20206   25979864  
>   21:   100   2400  org.apache.spark.mllib.tree.impl.TreePoint
>   22:   100   2400  
> org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:908332   21799968  
> scala.collection.immutable.HashMap$HashTrieMap
>   24: 20206   20158864  
>   25: 17023   14380352  
>   26:16   13308288  
> [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
>   27:445797   10699128  scala.Tuple2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13434) Reduce Spark RandomForest memory footprint

2016-02-22 Thread Ewan Higgs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Higgs updated SPARK-13434:
---
Attachment: rf-heap-usage.png

JConsole output of memory use with 1.3G file.

> Reduce Spark RandomForest memory footprint
> --
>
> Key: SPARK-13434
> URL: https://issues.apache.org/jira/browse/SPARK-13434
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Ewan Higgs
>  Labels: decisiontree, mllib, randomforest
> Attachments: rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate 
> datasets. This was raised in the a user's benchmarking game on github 
> (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there 
> was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the 
> RandomForest training on largish datasets on machines with 64G memory and the 
> following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an 
> input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
> --driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
> single machine by running {{jmap -histo:live }}. I took a sample 
> every 5 seconds and at the peak it looks like this:
> {code}
>  num #instances #bytes  class name
> --
>1:   5428073 8458773496  [D
>2:  12293653 4124641992  [I
>3:  32508964 1820501984  org.apache.spark.mllib.tree.model.Node
>4:  53068426 1698189632  org.apache.spark.mllib.tree.model.Predict
>5:  72853787 1165660592  scala.Some
>6:  16263408  910750848  
> org.apache.spark.mllib.tree.model.InformationGainStats
>7: 72969  390492744  [B
>8:   3327008  133080320  
> org.apache.spark.mllib.tree.impl.DTStatsAggregator
>9:   3754500  120144000  
> scala.collection.immutable.HashMap$HashMap1
>   10:   3318349  106187168  org.apache.spark.mllib.tree.model.Split
>   11:   3534946   84838704  
> org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:   3764745   60235920  java.lang.Integer
>   13:   3327008   53232128  
> org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:380804   45361144  [C
>   15:268887   34877128  
>   16:268887   34431568  
>   17:908377   34042760  [Lscala.collection.immutable.HashMap;
>   18:   110   2640  
> org.apache.spark.mllib.regression.LabeledPoint
>   19:   110   2640  org.apache.spark.mllib.linalg.SparseVector
>   20: 20206   25979864  
>   21:   100   2400  org.apache.spark.mllib.tree.impl.TreePoint
>   22:   100   2400  
> org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:908332   21799968  
> scala.collection.immutable.HashMap$HashTrieMap
>   24: 20206   20158864  
>   25: 17023   14380352  
>   26:16   13308288  
> [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
>   27:445797   10699128  scala.Tuple2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13434) Reduce Spark RandomForest memory footprint

2016-02-22 Thread Ewan Higgs (JIRA)
Ewan Higgs created SPARK-13434:
--

 Summary: Reduce Spark RandomForest memory footprint
 Key: SPARK-13434
 URL: https://issues.apache.org/jira/browse/SPARK-13434
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.6.0
 Environment: Linux
Reporter: Ewan Higgs


The RandomForest implementation can easily run out of memory on moderate 
datasets. This was raised in the a user's benchmarking game on github 
(https://github.com/szilard/benchm-ml/issues/19). I looked to see if there was 
a tracking issue, but I couldn't fine one.

Using Spark 1.6, a user of mine is running into problems running the 
RandomForest training on largish datasets on machines with 64G memory and the 
following in {{spark-defaults.conf}}:

{code}
spark.executor.cores 2
spark.executor.instances 199
spark.executor.memory 10240M
{code}

I reproduced the excessive memory use from the benchmark example (using an 
input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
--driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
single machine by running {{jmap -histo:live }}. I took a sample 
every 5 seconds and at the peak it looks like this:

{code}
 num #instances #bytes  class name
--
   1:   5428073 8458773496  [D
   2:  12293653 4124641992  [I
   3:  32508964 1820501984  org.apache.spark.mllib.tree.model.Node
   4:  53068426 1698189632  org.apache.spark.mllib.tree.model.Predict
   5:  72853787 1165660592  scala.Some
   6:  16263408  910750848  
org.apache.spark.mllib.tree.model.InformationGainStats
   7: 72969  390492744  [B
   8:   3327008  133080320  
org.apache.spark.mllib.tree.impl.DTStatsAggregator
   9:   3754500  120144000  scala.collection.immutable.HashMap$HashMap1
  10:   3318349  106187168  org.apache.spark.mllib.tree.model.Split
  11:   3534946   84838704  
org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
  12:   3764745   60235920  java.lang.Integer
  13:   3327008   53232128  
org.apache.spark.mllib.tree.impurity.EntropyAggregator
  14:380804   45361144  [C
  15:268887   34877128  
  16:268887   34431568  
  17:908377   34042760  [Lscala.collection.immutable.HashMap;
  18:   110   2640  
org.apache.spark.mllib.regression.LabeledPoint
  19:   110   2640  org.apache.spark.mllib.linalg.SparseVector
  20: 20206   25979864  
  21:   100   2400  org.apache.spark.mllib.tree.impl.TreePoint
  22:   100   2400  org.apache.spark.mllib.tree.impl.BaggedPoint
  23:908332   21799968  
scala.collection.immutable.HashMap$HashTrieMap
  24: 20206   20158864  
  25: 17023   14380352  
  26:16   13308288  
[Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
  27:445797   10699128  scala.Tuple2
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files

2015-12-03 Thread Ewan Higgs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037593#comment-15037593
 ] 

Ewan Higgs commented on SPARK-5836:
---

[~tdas] 
{quote}
The only case there may be issues is when the external shuffle service is used.
{quote}

I see this problematic behaviour in ipython/pyspark notebooks. We can try to go 
through and unpersist and checkpoint and so on with the RDDs but the shuffle 
files don't seem to go away. We see this even though we are not using the 
external shuffle service.

> Highlight in Spark documentation that by default Spark does not delete its 
> temporary files
> --
>
> Key: SPARK-5836
> URL: https://issues.apache.org/jira/browse/SPARK-5836
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Tomasz Dudziak
>Assignee: Ilya Ganelin
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
>
> We recently learnt the hard way (in a prod system) that Spark by default does 
> not delete its temporary files until it is stopped. WIthin a relatively short 
> time span of heavy Spark use the disk of our prod machine filled up 
> completely because of multiple shuffle files written to it. We think there 
> should be better documentation around the fact that after a job is finished 
> it leaves a lot of rubbish behind so that this does not come as a surprise.
> Probably a good place to highlight that fact would be the documentation of 
> {{spark.local.dir}} property, which controls where Spark temporary files are 
> written. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems

2015-04-27 Thread Ewan Higgs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Higgs resolved SPARK-5300.
---
Resolution: Won't Fix

I submitted a fix at the FileSystem level based on comments in the mailing 
list. The patch was rejected because it's expected that anyone implementing a 
file input format should make sure the files are loaded in order. They can do 
that by overriding the listStatus function as follows:

{code}
  // Sort the file pieces since order matters.
  override def listStatus(job: JobContext): List[FileStatus] = {
val listing = super.listStatus(job)
val sortedListing= listing.sortWith{ (lhs, rhs) => { 
  lhs.getPath().compareTo(rhs.getPath()) < 0 
} }
sortedListing.toList
  }
{code}

> Spark loads file partitions in inconsistent order on native filesystems
> ---
>
> Key: SPARK-5300
> URL: https://issues.apache.org/jira/browse/SPARK-5300
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.1.0, 1.2.0
> Environment: Linux, EXT4, for example.
>Reporter: Ewan Higgs
>
> Discussed on user list in April 2014:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html
> And on dev list January 2015:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html
> When using a file system which isn't HDFS, file partitions ('part-0, 
> part-1', etc.) are not guaranteed to load in the same order. This means 
> previously sorted RDDs will be loaded out of order. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems

2015-01-27 Thread Ewan Higgs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293201#comment-14293201
 ] 

Ewan Higgs commented on SPARK-5300:
---

The PR appears to have been rejected on the grounds that all FileInputFormats 
that want sorting should sort the file parts. Perhaps this could be documented 
more clearly.

> Spark loads file partitions in inconsistent order on native filesystems
> ---
>
> Key: SPARK-5300
> URL: https://issues.apache.org/jira/browse/SPARK-5300
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.1.0, 1.2.0
> Environment: Linux, EXT4, for example.
>Reporter: Ewan Higgs
>
> Discussed on user list in April 2014:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html
> And on dev list January 2015:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html
> When using a file system which isn't HDFS, file partitions ('part-0, 
> part-1', etc.) are not guaranteed to load in the same order. This means 
> previously sorted RDDs will be loaded out of order. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems

2015-01-17 Thread Ewan Higgs (JIRA)
Ewan Higgs created SPARK-5300:
-

 Summary: Spark loads file partitions in inconsistent order on 
native filesystems
 Key: SPARK-5300
 URL: https://issues.apache.org/jira/browse/SPARK-5300
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.2.0, 1.1.0
 Environment: Linux, EXT4, for example.
Reporter: Ewan Higgs


Discussed on user list in April 2014:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html

And on dev list January 2015:
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

When using a file system which isn't HDFS, file partitions ('part-0, 
part-1', etc.) are not guaranteed to load in the same order. This means 
previously sorted RDDs will be loaded out of order. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org