[jira] [Commented] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length

2016-08-02 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405333#comment-15405333
 ] 

zhengruifeng commented on SPARK-16863:
--

[SPARK-16851] add checking for 
{{ProbabilisticClassificationModel.setThreshoulds}}.

I just found that {{ProbabilisticClassifier.fit}} also need to check 
threshoulds' length if {{ProbabilisticClassifier.setThreshoulds}} is called, so 
I open this JIRA.

> ProbabilisticClassifier.fit check threshoulds' length
> -
>
> Key: SPARK-16863
> URL: https://issues.apache.org/jira/browse/SPARK-16863
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> val path = 
> "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt"
> val data = spark.read.format("libsvm").load(path)
> val rf = new RandomForestClassifier
> rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5))
> val rfm = rf.fit(data)
> rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees
> rfm.numClasses
> res2: Int = 3
> rfm.getThresholds
> res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5)
> rfm.transform(data)
> java.lang.IllegalArgumentException: requirement failed: 
> RandomForestClassificationModel.transform() called with non-matching 
> numClasses and thresholds.length. numClasses=3, but thresholds has length 5
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101)
>   ... 72 elided
> {code}
> {{ProbabilisticClassifier.fit()}} should throw some exception if it's 
> threshoulds is set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7146) Should ML sharedParams be a public API?

2016-08-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300
 ] 

Nicholas Chammas edited comment on SPARK-7146 at 8/3/16 4:45 AM:
-

A quick update from a PySpark user: I am using HasInputCol, HasInputCols, 
HasLabelCol, and HasOutputCol to create custom transformers, and I find them 
very handy.

I know Python does not have a notion of "private" classes, but knowing these 
are part of the public API would be good.

In summary: The updated proposal looks good to me, with the caveat that I only 
just started learning the new ML Pipeline API.


was (Author: nchammas):
A quick update from a PySpark user: I am using HasInputCol, HasInputCols, 
HasLabelCol, and HasOutputCol to create custom transformers, and I find them 
very handy.

I know Python does not have a notion of "private" classes, but knowing these 
are part of the public API would be good.

I summary: The updated proposal looks good to me, with the caveat that I only 
just started learning the new ML Pipeline API.

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2016-08-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300
 ] 

Nicholas Chammas commented on SPARK-7146:
-

A quick update from a PySpark user: I am using HasInputCol, HasInputCols, 
HasLabelCol, and HasOutputCol to create custom transformers, and I find them 
very handy.

I know Python does not have a notion of "private" classes, but knowing these 
are part of the public API would be good.

I summary: The updated proposal looks good to me, with the caveat that I only 
just started learning the new ML Pipeline API.

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length

2016-08-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405276#comment-15405276
 ] 

Sean Owen commented on SPARK-16863:
---

Wait, how is this different from SPARK-16851? I don't see why these were opened 
as different JIRAs.

> ProbabilisticClassifier.fit check threshoulds' length
> -
>
> Key: SPARK-16863
> URL: https://issues.apache.org/jira/browse/SPARK-16863
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> val path = 
> "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt"
> val data = spark.read.format("libsvm").load(path)
> val rf = new RandomForestClassifier
> rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5))
> val rfm = rf.fit(data)
> rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees
> rfm.numClasses
> res2: Int = 3
> rfm.getThresholds
> res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5)
> rfm.transform(data)
> java.lang.IllegalArgumentException: requirement failed: 
> RandomForestClassificationModel.transform() called with non-matching 
> numClasses and thresholds.length. numClasses=3, but thresholds has length 5
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101)
>   ... 72 elided
> {code}
> {{ProbabilisticClassifier.fit()}} should throw some exception if it's 
> threshoulds is set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16862:


Assignee: (was: Apache Spark)

> Configurable buffer size in `UnsafeSorterSpillReader`
> -
>
> Key: SPARK-16862
> URL: https://issues.apache.org/jira/browse/SPARK-16862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Priority: Minor
>
> `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k 
> buffer to read data off disk. This could be made configurable to improve on 
> disk reads.
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405245#comment-15405245
 ] 

Apache Spark commented on SPARK-16862:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/14475

> Configurable buffer size in `UnsafeSorterSpillReader`
> -
>
> Key: SPARK-16862
> URL: https://issues.apache.org/jira/browse/SPARK-16862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Priority: Minor
>
> `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k 
> buffer to read data off disk. This could be made configurable to improve on 
> disk reads.
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16862:


Assignee: Apache Spark

> Configurable buffer size in `UnsafeSorterSpillReader`
> -
>
> Key: SPARK-16862
> URL: https://issues.apache.org/jira/browse/SPARK-16862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Minor
>
> `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k 
> buffer to read data off disk. This could be made configurable to improve on 
> disk reads.
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16853) Analysis error for DataSet typed selection

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16853:


Assignee: (was: Apache Spark)

> Analysis error for DataSet typed selection
> --
>
> Key: SPARK-16853
> URL: https://issues.apache.org/jira/browse/SPARK-16853
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>
> For DataSet typed selection
> {code}
> def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
> {code}
> If U1 contains sub-fields, then it reports AnalysisException 
> Reproducer:
> {code}
> scala> case class A(a: Int, b: Int)
> scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
> org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input 
> columns: [_2];
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16853) Analysis error for DataSet typed selection

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405220#comment-15405220
 ] 

Apache Spark commented on SPARK-16853:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14474

> Analysis error for DataSet typed selection
> --
>
> Key: SPARK-16853
> URL: https://issues.apache.org/jira/browse/SPARK-16853
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>
> For DataSet typed selection
> {code}
> def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
> {code}
> If U1 contains sub-fields, then it reports AnalysisException 
> Reproducer:
> {code}
> scala> case class A(a: Int, b: Int)
> scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
> org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input 
> columns: [_2];
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16853) Analysis error for DataSet typed selection

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16853:


Assignee: Apache Spark

> Analysis error for DataSet typed selection
> --
>
> Key: SPARK-16853
> URL: https://issues.apache.org/jira/browse/SPARK-16853
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Assignee: Apache Spark
>
> For DataSet typed selection
> {code}
> def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
> {code}
> If U1 contains sub-fields, then it reports AnalysisException 
> Reproducer:
> {code}
> scala> case class A(a: Int, b: Int)
> scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
> org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input 
> columns: [_2];
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16495) Add ADMM optimizer in mllib package

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405217#comment-15405217
 ] 

Apache Spark commented on SPARK-16495:
--

User 'ZunwenYou' has created a pull request for this issue:
https://github.com/apache/spark/pull/14473

> Add ADMM optimizer in mllib package
> ---
>
> Key: SPARK-16495
> URL: https://issues.apache.org/jira/browse/SPARK-16495
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: zunwen you
>
>  Alternating Direction Method of Multipliers (ADMM) is well suited to 
> distributed convex optimization, and in particular to large-scale problems 
> arising in statistics, machine learning, and related areas.
> Details can be found in the [S. Boyd's 
> paper](http://www.stanford.edu/~boyd/papers/admm_distr_stats.html).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16866) Basic infrastructure for file-based SQL end-to-end tests

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16866:


Assignee: Apache Spark

> Basic infrastructure for file-based SQL end-to-end tests
> 
>
> Key: SPARK-16866
> URL: https://issues.apache.org/jira/browse/SPARK-16866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16866) Basic infrastructure for file-based SQL end-to-end tests

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405203#comment-15405203
 ] 

Apache Spark commented on SPARK-16866:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/14472

> Basic infrastructure for file-based SQL end-to-end tests
> 
>
> Key: SPARK-16866
> URL: https://issues.apache.org/jira/browse/SPARK-16866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16866) Basic infrastructure for file-based SQL end-to-end tests

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16866:


Assignee: (was: Apache Spark)

> Basic infrastructure for file-based SQL end-to-end tests
> 
>
> Key: SPARK-16866
> URL: https://issues.apache.org/jira/browse/SPARK-16866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16866) Basic infrastructure for file-based SQL end-to-end tests

2016-08-02 Thread Peter Lee (JIRA)
Peter Lee created SPARK-16866:
-

 Summary: Basic infrastructure for file-based SQL end-to-end tests
 Key: SPARK-16866
 URL: https://issues.apache.org/jira/browse/SPARK-16866
 Project: Spark
  Issue Type: Sub-task
Reporter: Peter Lee






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16865) A file-based end-to-end SQL query suite

2016-08-02 Thread Peter Lee (JIRA)
Peter Lee created SPARK-16865:
-

 Summary: A file-based end-to-end SQL query suite
 Key: SPARK-16865
 URL: https://issues.apache.org/jira/browse/SPARK-16865
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Peter Lee


Spark currently has a large number of end-to-end SQL test cases in various 
SQLQuerySuites. It is fairly difficult to manage and operate these end-to-end 
test cases. This ticket proposes that we use files to specify queries and 
expected outputs.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14387) Exceptions thrown when querying ORC tables

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405192#comment-15405192
 ] 

Apache Spark commented on SPARK-14387:
--

User 'rajeshbalamohan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14471

> Exceptions thrown when querying ORC tables
> --
>
> Key: SPARK-14387
> URL: https://issues.apache.org/jira/browse/SPARK-14387
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>
> In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB 
> scale. Initially I got the following exception (as FileScanRDD has been made 
> the default in master branch)
> {noformat}
> 16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0. 
> java.lang.IllegalArgumentException: Field "s_store_sk" does not exist.
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361)
> {noformat}
> When running with "spark.sql.sources.fileScan=false", following exception is 
> thrown
> {noformat}
> 16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING,
> java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist.
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124)
> at 
> 

[jira] [Created] (SPARK-16864) Comprehensive version info

2016-08-02 Thread jay vyas (JIRA)
jay vyas created SPARK-16864:


 Summary: Comprehensive version info 
 Key: SPARK-16864
 URL: https://issues.apache.org/jira/browse/SPARK-16864
 Project: Spark
  Issue Type: Improvement
Reporter: jay vyas


Spark versions can be grepped out of the Spark banner that comes up on startup, 
but otherwise, there is no programmatic/reliable way to get version information.

Also there is no git commit id, etc.  So precise version checking isnt possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16863:


Assignee: Apache Spark

> ProbabilisticClassifier.fit check threshoulds' length
> -
>
> Key: SPARK-16863
> URL: https://issues.apache.org/jira/browse/SPARK-16863
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> {code}
> val path = 
> "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt"
> val data = spark.read.format("libsvm").load(path)
> val rf = new RandomForestClassifier
> rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5))
> val rfm = rf.fit(data)
> rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees
> rfm.numClasses
> res2: Int = 3
> rfm.getThresholds
> res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5)
> rfm.transform(data)
> java.lang.IllegalArgumentException: requirement failed: 
> RandomForestClassificationModel.transform() called with non-matching 
> numClasses and thresholds.length. numClasses=3, but thresholds has length 5
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101)
>   ... 72 elided
> {code}
> {{ProbabilisticClassifier.fit()}} should throw some exception if it's 
> threshoulds is set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405182#comment-15405182
 ] 

Apache Spark commented on SPARK-16863:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/14470

> ProbabilisticClassifier.fit check threshoulds' length
> -
>
> Key: SPARK-16863
> URL: https://issues.apache.org/jira/browse/SPARK-16863
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> val path = 
> "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt"
> val data = spark.read.format("libsvm").load(path)
> val rf = new RandomForestClassifier
> rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5))
> val rfm = rf.fit(data)
> rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees
> rfm.numClasses
> res2: Int = 3
> rfm.getThresholds
> res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5)
> rfm.transform(data)
> java.lang.IllegalArgumentException: requirement failed: 
> RandomForestClassificationModel.transform() called with non-matching 
> numClasses and thresholds.length. numClasses=3, but thresholds has length 5
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101)
>   ... 72 elided
> {code}
> {{ProbabilisticClassifier.fit()}} should throw some exception if it's 
> threshoulds is set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16863:


Assignee: (was: Apache Spark)

> ProbabilisticClassifier.fit check threshoulds' length
> -
>
> Key: SPARK-16863
> URL: https://issues.apache.org/jira/browse/SPARK-16863
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> {code}
> val path = 
> "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt"
> val data = spark.read.format("libsvm").load(path)
> val rf = new RandomForestClassifier
> rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5))
> val rfm = rf.fit(data)
> rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees
> rfm.numClasses
> res2: Int = 3
> rfm.getThresholds
> res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5)
> rfm.transform(data)
> java.lang.IllegalArgumentException: requirement failed: 
> RandomForestClassificationModel.transform() called with non-matching 
> numClasses and thresholds.length. numClasses=3, but thresholds has length 5
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101)
>   ... 72 elided
> {code}
> {{ProbabilisticClassifier.fit()}} should throw some exception if it's 
> threshoulds is set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length

2016-08-02 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-16863:


 Summary: ProbabilisticClassifier.fit check threshoulds' length
 Key: SPARK-16863
 URL: https://issues.apache.org/jira/browse/SPARK-16863
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng
Priority: Minor


{code}
val path = 
"./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt"
val data = spark.read.format("libsvm").load(path)
val rf = new RandomForestClassifier
rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5))

val rfm = rf.fit(data)
rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees

rfm.numClasses
res2: Int = 3

rfm.getThresholds
res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5)

rfm.transform(data)
java.lang.IllegalArgumentException: requirement failed: 
RandomForestClassificationModel.transform() called with non-matching numClasses 
and thresholds.length. numClasses=3, but thresholds has length 5
  at scala.Predef$.require(Predef.scala:224)
  at 
org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101)
  ... 72 elided
{code}

{{ProbabilisticClassifier.fit()}} should throw some exception if it's 
threshoulds is set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`

2016-08-02 Thread Tejas Patil (JIRA)
Tejas Patil created SPARK-16862:
---

 Summary: Configurable buffer size in `UnsafeSorterSpillReader`
 Key: SPARK-16862
 URL: https://issues.apache.org/jira/browse/SPARK-16862
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Tejas Patil
Priority: Minor


`BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k 
buffer to read data off disk. This could be made configurable to improve on 
disk reads.

https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable

2016-08-02 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405112#comment-15405112
 ] 

Saisai Shao commented on SPARK-14453:
-

If you want to fix this issue, it would be better target to SPARK-12344.

> Remove SPARK_JAVA_OPTS environment variable
> ---
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-02 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405071#comment-15405071
 ] 

Hyukjin Kwon commented on SPARK-16610:
--

One thought is, we might have to document that we don't respect Hadoop 
configuration anymore officially if it is.

> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-08-02 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405069#comment-15405069
 ] 

Xusen Yin commented on SPARK-16857:
---

I agree the cluster assignments could be arbitrary. Yes under this condition we 
shouldn't use MulticlassClassificationEvaluator to evaluate the result.

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405057#comment-15405057
 ] 

Sean Zhong commented on SPARK-16320:


[~maver1ck] Did you use the test case in this jira
{code}
select count(*) where nested_column.id > some_id
{code}

Or the test case in SPARK-16321?
{code}
df = sqlctx.read.parquet(path)
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 else 
[]).collect()
{code}

There is a slight difference. In the first test case, the filter condition uses 
nested fields, the filter push down is not supported. So, your PR 14465 should 
not be effective. 

Basically SPARK-16320 and SPARK-16321 are different problems.




> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-08-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405055#comment-15405055
 ] 

Sean Owen commented on SPARK-16857:
---

I don't think that in general it makes sense to use 
MulticlassClassificationEvaluator with k-means, since there is in general no 
known correct 'label' (cluster assignment). If you do happen to have these 
assignments, yes you can do something like the above, but I might expect you do 
have to do a little extra work like convert the integer cluster assignment into 
a double. That's arguably as it should be, I think.

(Keep in mind that cluster numbering is arbitrary; are you sure this will be 
valid to compare against a fixed set of cluster assignments?)

There are other k-means evaluation metrics, but they aren't evaluations on the 
cluster assignment itself, but things like intra-cluster distance.

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-08-02 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405050#comment-15405050
 ] 

Xusen Yin commented on SPARK-16857:
---

Using CrossValidator with KMeans should be supported. As a kind of external 
evaluation for KMeans, I think using MulticlassClassificationEvaluator with 
KMeans should also be supported. Why not send a PR since it would be a quick 
fix.

CC [~yanboliang]

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405044#comment-15405044
 ] 

Apache Spark commented on SPARK-16700:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14469

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16700:


Assignee: (was: Apache Spark)

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405043#comment-15405043
 ] 

Davies Liu commented on SPARK-16700:


Sent PR https://github.com/apache/spark/pull/14469 to address these, could you 
help to test and review them?

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16700:


Assignee: Apache Spark

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Apache Spark
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-16700:
--

Assignee: Davies Liu

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Davies Liu
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15541) SparkContext.stop throws error

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15541:
--
Fix Version/s: 2.1.0
   2.0.1
   1.6.3

> SparkContext.stop throws error
> --
>
> Key: SPARK-15541
> URL: https://issues.apache.org/jira/browse/SPARK-15541
> Project: Spark
>  Issue Type: Bug
>Reporter: Miao Wang
>Assignee: Maciej Bryński
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> When running unit-tests or examples from command line or Intellij, 
> SparkContext throws errors.
> For example:
> ./bin/run-example ml.JavaNaiveBayesExample
> Exception in thread "main" 16/05/25 15:17:55 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> java.lang.NoSuchMethodError: 
> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>   at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:176)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:291)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:269)
>   at org.apache.spark.SparkEnv.stop(SparkEnv.scala:91)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1796)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1795)
>   at org.apache.spark.sql.SparkSession.stop(SparkSession.scala:577)
>   at 
> org.apache.spark.examples.ml.JavaNaiveBayesExample.main(JavaNaiveBayesExample.java:61)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 16/05/25 15:17:55 INFO ShutdownHookManager: Shutdown hook called



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16671) Merge variable substitution code in core and SQL

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16671:


Assignee: (was: Apache Spark)

> Merge variable substitution code in core and SQL
> 
>
> Key: SPARK-16671
> URL: https://issues.apache.org/jira/browse/SPARK-16671
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> SPARK-16272 added support for variable substitution in configs in the core 
> Spark configuration. That code has a lot of similarities with SQL's 
> {{VariableSubstitution}}, and we should make both use the same code as much 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16671) Merge variable substitution code in core and SQL

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16671:


Assignee: Apache Spark

> Merge variable substitution code in core and SQL
> 
>
> Key: SPARK-16671
> URL: https://issues.apache.org/jira/browse/SPARK-16671
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-16272 added support for variable substitution in configs in the core 
> Spark configuration. That code has a lot of similarities with SQL's 
> {{VariableSubstitution}}, and we should make both use the same code as much 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16671) Merge variable substitution code in core and SQL

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405022#comment-15405022
 ] 

Apache Spark commented on SPARK-16671:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14468

> Merge variable substitution code in core and SQL
> 
>
> Key: SPARK-16671
> URL: https://issues.apache.org/jira/browse/SPARK-16671
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> SPARK-16272 added support for variable substitution in configs in the core 
> Spark configuration. That code has a lot of similarities with SQL's 
> {{VariableSubstitution}}, and we should make both use the same code as much 
> as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16796) Visible passwords on Spark environment page

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16796.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14409
[https://github.com/apache/spark/pull/14409]

> Visible passwords on Spark environment page
> ---
>
> Key: SPARK-16796
> URL: https://issues.apache.org/jira/browse/SPARK-16796
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Artur
>Assignee: Artur
>Priority: Trivial
> Fix For: 2.1.0
>
> Attachments: 
> Mask_spark_ssl_keyPassword_spark_ssl_keyStorePassword_spark_ssl_trustStorePassword_from_We1.patch
>
>
> Spark properties (passwords): 
> spark.ssl.keyPassword,spark.ssl.keyStorePassword,spark.ssl.trustStorePassword
> are visible in Web UI in environment page.
> Can we mask them from Web UI?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16861) Refactor PySpark accumulator API to be on top of AccumulatorV2 API

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16861:


Assignee: (was: Apache Spark)

> Refactor PySpark accumulator API to be on top of AccumulatorV2 API
> --
>
> Key: SPARK-16861
> URL: https://issues.apache.org/jira/browse/SPARK-16861
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: holdenk
>
> The PySpark accumulator API is implemented on top of the legacy accumulator 
> API which is now deprecated. To allow this deprecated API to be removed 
> eventually we will need to rewrite the PySpark accumulators on top of the new 
> API. An open question for a possible follow up is if we want to make the 
> Python accumulator API look more like the new accumulator API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16861) Refactor PySpark accumulator API to be on top of AccumulatorV2 API

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404891#comment-15404891
 ] 

Apache Spark commented on SPARK-16861:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/14467

> Refactor PySpark accumulator API to be on top of AccumulatorV2 API
> --
>
> Key: SPARK-16861
> URL: https://issues.apache.org/jira/browse/SPARK-16861
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: holdenk
>
> The PySpark accumulator API is implemented on top of the legacy accumulator 
> API which is now deprecated. To allow this deprecated API to be removed 
> eventually we will need to rewrite the PySpark accumulators on top of the new 
> API. An open question for a possible follow up is if we want to make the 
> Python accumulator API look more like the new accumulator API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16861) Refactor PySpark accumulator API to be on top of AccumulatorV2 API

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16861:


Assignee: Apache Spark

> Refactor PySpark accumulator API to be on top of AccumulatorV2 API
> --
>
> Key: SPARK-16861
> URL: https://issues.apache.org/jira/browse/SPARK-16861
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: holdenk
>Assignee: Apache Spark
>
> The PySpark accumulator API is implemented on top of the legacy accumulator 
> API which is now deprecated. To allow this deprecated API to be removed 
> eventually we will need to rewrite the PySpark accumulators on top of the new 
> API. An open question for a possible follow up is if we want to make the 
> Python accumulator API look more like the new accumulator API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16861) Refactor PySpark accumulator API to be on top of AccumulatorV2 API

2016-08-02 Thread holdenk (JIRA)
holdenk created SPARK-16861:
---

 Summary: Refactor PySpark accumulator API to be on top of 
AccumulatorV2 API
 Key: SPARK-16861
 URL: https://issues.apache.org/jira/browse/SPARK-16861
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Reporter: holdenk


The PySpark accumulator API is implemented on top of the legacy accumulator API 
which is now deprecated. To allow this deprecated API to be removed eventually 
we will need to rewrite the PySpark accumulators on top of the new API. An open 
question for a possible follow up is if we want to make the Python accumulator 
API look more like the new accumulator API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16858) Removal of TestHiveSharedState

2016-08-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16858.
-
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.1.0

> Removal of TestHiveSharedState
> --
>
> Key: SPARK-16858
> URL: https://issues.apache.org/jira/browse/SPARK-16858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Remove TestHiveSharedState. Otherwise, we are not really testing the 
> reflection logic based on the setting of we are not really testing the 
> reflection logic based on the setting of CATALOG_IMPLEMENTATION.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404676#comment-15404676
 ] 

Apache Spark commented on SPARK-16321:
--

User 'maver1ck' has created a pull request for this issue:
https://github.com/apache/spark/pull/14465

> Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
> -
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, 
> spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, 
> spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, 
> visualvm_spark2_G1GC.png
>
>
> *UPDATE*
> Please start with this comment 
> https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785
> I assume that problem results from the performance problem with reading 
> parquet files
> *Original Issue description*
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16321:


Assignee: (was: Apache Spark)

> Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
> -
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, 
> spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, 
> spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, 
> visualvm_spark2_G1GC.png
>
>
> *UPDATE*
> Please start with this comment 
> https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785
> I assume that problem results from the performance problem with reading 
> parquet files
> *Original Issue description*
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16321:


Assignee: Apache Spark

> Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
> -
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>Priority: Critical
> Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, 
> spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, 
> spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, 
> visualvm_spark2_G1GC.png
>
>
> *UPDATE*
> Please start with this comment 
> https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785
> I assume that problem results from the performance problem with reading 
> parquet files
> *Original Issue description*
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404675#comment-15404675
 ] 

Apache Spark commented on SPARK-16320:
--

User 'maver1ck' has created a pull request for this issue:
https://github.com/apache/spark/pull/14465

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file

2016-08-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404654#comment-15404654
 ] 

Maciej Bryński commented on SPARK-16321:


[~smilegator]
spark.sql.parquet.filterPushdown has true as a default.
Vectorized Reader isn't a case here because I have nested columns (and 
Vectorized Reader works only with Atomic Types)

> Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
> -
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, 
> spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, 
> spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, 
> visualvm_spark2_G1GC.png
>
>
> *UPDATE*
> Please start with this comment 
> https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785
> I assume that problem results from the performance problem with reading 
> parquet files
> *Original Issue description*
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16860) UDT Stringification Incorrect in PySpark

2016-08-02 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-16860:
-

 Summary: UDT Stringification Incorrect in PySpark
 Key: SPARK-16860
 URL: https://issues.apache.org/jira/browse/SPARK-16860
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Vladimir Feinberg
Priority: Minor


When using `show()` on a `DataFrame` containing a UDT, Spark doesn't call the 
appropriate `__str__` method for display.

Example: https://gist.github.com/vlad17/baa8e18ed724c4d88436a92ca159dd5b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file

2016-08-02 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404652#comment-15404652
 ] 

Xiao Li commented on SPARK-16321:
-

Can you set  `spark.sql.parquet.enableVectorizedReader` to false and 
`spark.sql.parquet.filterPushdown` to true?

BTW, keep the other settings unchanged.

> Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
> -
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, 
> spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, 
> spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, 
> visualvm_spark2_G1GC.png
>
>
> *UPDATE*
> Please start with this comment 
> https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785
> I assume that problem results from the performance problem with reading 
> parquet files
> *Original Issue description*
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404646#comment-15404646
 ] 

Maciej Bryński edited comment on SPARK-16320 at 8/2/16 7:38 PM:


[~michael], [~yhuai], [~rxin]
I think this is smallest change that resolves my problem.
{code}
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index f1c78bb..2b84885 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
@@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat
 None
   }

+pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _))
+
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))

{code}

Can anybody tell why it works ?


was (Author: maver1ck):
[~michael], [~yhuai]
I think this is smallest change that resolves my problem.
{code}
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index f1c78bb..2b84885 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
@@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat
 None
   }

+pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _))
+
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))

{code}

Can anybody tell why it works ?

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> 

[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404646#comment-15404646
 ] 

Maciej Bryński edited comment on SPARK-16320 at 8/2/16 7:37 PM:


[~michael], [~yhuai]
I think this is smallest change that resolves my problem.
{code}
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index f1c78bb..2b84885 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
@@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat
 None
   }

+pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _))
+
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))

{code}

Can anybody tell why it works ?


was (Author: maver1ck):
[~michael], [~yhuai]
I think this is smallest change that resolves my problem.
{code}
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index f1c78bb..2b84885 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
@@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat
 None
   }

+pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _))
+
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))

{code}

Is anybody can tell why it works ?

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> 

[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404646#comment-15404646
 ] 

Maciej Bryński commented on SPARK-16320:


[~michael], [~yhuai]
I think this is smallest change that resolves my problem.
{code}
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index f1c78bb..2b84885 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
@@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat
 None
   }

+pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _))
+
 val broadcastedHadoopConf =
   sparkSession.sparkContext.broadcast(new 
SerializableConfiguration(hadoopConf))

{code}

Is anybody can tell why it works ?

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404620#comment-15404620
 ] 

Maciej Bryński commented on SPARK-16320:


I think that problem is already resolved by 
https://github.com/apache/spark/pull/13701

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404619#comment-15404619
 ] 

Maciej Bryński commented on SPARK-16320:


Yes.
That's it.
With this PR Spark 2.0 is faster than 1.6.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404611#comment-15404611
 ] 

Apache Spark commented on SPARK-16802:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14464

> joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
> 
>
> Key: SPARK-16802
> URL: https://issues.apache.org/jira/browse/SPARK-16802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Davies Liu
>Priority: Critical
>
> Hello!
> This is a little similar to 
> [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I 
> have reopened it?).
> I would recommend to give another full review to {{HashedRelation.scala}}, 
> particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors 
> that I haven't managed to reproduce so far, as well as what I suspect could 
> be memory leaks (I have a query in a loop OOMing after a few iterations 
> despite not caching its results).
> Here is the script to reproduce the ArrayIndexOutOfBoundsException on the 
> current 2.0 branch:
> {code}
> import os
> import random
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> def randlong():
> return random.randint(-9223372036854775808, 9223372036854775807)
> while True:
> l1, l2 = randlong(), randlong()
> # Sample values that crash:
> # l1, l2 = 4661454128115150227, -5543241376386463808
> print "Testing with %s, %s" % (l1, l2)
> data1 = [(l1, ), (l2, )]
> data2 = [(l1, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> sqlc.registerDataFrameAsTable(df1, "df1")
> sqlc.registerDataFrameAsTable(df2, "df2")
> result_df = sqlc.sql("""
> SELECT
> df1.id1
> FROM df1
> LEFT OUTER JOIN df2 ON df1.id1 = df2.id2
> """)
> print result_df.collect()
> {code}
> {code}
> java.lang.ArrayIndexOutOfBoundsException: 1728150825
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463)
>   at 
> org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898)
>   at 
> 

[jira] [Resolved] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file

2016-08-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-16787.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14396
[https://github.com/apache/spark/pull/14396]

> SparkContext.addFile() should not fail if called twice with the same file
> -
>
> Key: SPARK-16787
> URL: https://issues.apache.org/jira/browse/SPARK-16787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.1, 2.1.0
>
>
> The behavior of SparkContext.addFile() changed slightly with the introduction 
> of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where 
> it was disabled by default) and became the default / only file server in 
> Spark 2.0.0.
> Prior to 2.0, calling SparkContext.addFile() twice with the same path would 
> succeed and would cause future tasks to receive an updated copy of the file. 
> This behavior was never explicitly documented but Spark has behaved this way 
> since very early 1.x versions (some of the relevant lines in 
> Executor.updateDependencies() have existed since 2012).
> In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call 
> will fail with a requirement error because NettyStreamManager tries to guard 
> against duplicate file registration.
> I believe that this change of behavior was unintentional and propose to 
> remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior.
> This problem also affects addJar() in a more subtle way: the 
> fileServer.addJar() call will also fail with an exception but that exception 
> is logged and ignored due to some code which was added in 2014 in order to 
> ignore errors caused by missing Spark examples JARs when running on YARN 
> cluster mode (AFAIK).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16838) Add PMML export for ML KMeans in PySpark

2016-08-02 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404591#comment-15404591
 ] 

Gayathri Murali commented on SPARK-16838:
-

I can work on this

> Add PMML export for ML KMeans in PySpark
> 
>
> Key: SPARK-16838
> URL: https://issues.apache.org/jira/browse/SPARK-16838
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>
> After we finish SPARK-11237 we should also expose PMML export in the Python 
> API for KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16859) History Server storage information is missing

2016-08-02 Thread Andrey Ivanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Ivanov updated SPARK-16859:
--
Description: 
It looks like job history storage tab in history server is broken for completed 
jobs since *1.6.2*. 
More specifically it's broken since 
[SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].

I've fixed for my installation by effectively reverting the above patch 
([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).

IMHO, the most straightforward fix would be to implement 
_SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ and making 
sure it works from _ReplayListenerBus_.

The downside will be that it will still work incorrectly with pre patch job 
histories. But then, it doesn't work since *1.6.2* anyhow.

PS: I'd really love to have this fixed eventually. But I'm pretty new to Apache 
Spark and missing hands on Scala experience. So  I'd prefer that it be fixed by 
someone experienced with roadmap vision. If nobody volunteers I'll try to patch 
myself.  

  was:
It looks like job history storage tab in history server is broken for completed 
jobs since *1.6.2*. 
More specifically it's broken since 
[SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].

I've fixed for my installation by effectively reverting the above patch 
([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).

IMHO, the most straightforward fix would be to implement 
_SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ and making 
sure it works from _ReplayListenerBus_.

The downside will be that it will still work incorrectly with pre patch job 
histories. But then, it doesn't work since *1.6.2* anyhow.


> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrey Ivanov
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ and 
> making sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions

2016-08-02 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-6399.
---
Resolution: Won't Fix

I think at this point it's pretty clear we won't do anything here.

> Code compiled against 1.3.0 may not run against older Spark versions
> 
>
> Key: SPARK-6399
> URL: https://issues.apache.org/jira/browse/SPARK-6399
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 1.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're 
> easier to use. The problem is that scalac now generates code that will not 
> run on older Spark versions if those conversions are used.
> Basically, even if you explicitly import {{SparkContext._}}, scalac will 
> generate references to the new methods in the {{RDD}} object instead. So the 
> compiled code will reference code that doesn't exist in older versions of 
> Spark.
> You can work around this by explicitly calling the methods in the 
> {{SparkContext}} object, although that's a little ugly.
> We should at least document this limitation (if there's no way to fix it), 
> since I believe forwards compatibility in the API was also a goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16859) History Server storage information is missing

2016-08-02 Thread Andrey Ivanov (JIRA)
Andrey Ivanov created SPARK-16859:
-

 Summary: History Server storage information is missing
 Key: SPARK-16859
 URL: https://issues.apache.org/jira/browse/SPARK-16859
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.2
Reporter: Andrey Ivanov


It looks like job history storage tab in history server is broken for completed 
jobs since *1.6.2*. 
More specifically it's broken since 
[SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].

I've fixed for my installation by effectively reverting the above patch 
([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).

IMHO, the most straightforward fix would be to implement 
_SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ and making 
sure it works from _ReplayListenerBus_.

The downside will be that it will still work incorrectly with pre patch job 
histories. But then, it doesn't work since *1.6.2* anyhow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16855) move Greatest and Least from conditionalExpressions.scala to arithmetic.scala

2016-08-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16855:

Fix Version/s: (was: 2.0.1)

> move Greatest and Least from conditionalExpressions.scala to arithmetic.scala
> -
>
> Key: SPARK-16855
> URL: https://issues.apache.org/jira/browse/SPARK-16855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16855) move Greatest and Least from conditionalExpressions.scala to arithmetic.scala

2016-08-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16855.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> move Greatest and Least from conditionalExpressions.scala to arithmetic.scala
> -
>
> Key: SPARK-16855
> URL: https://issues.apache.org/jira/browse/SPARK-16855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16796) Visible passwords on Spark environment page

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16796:
--
Assignee: Artur

> Visible passwords on Spark environment page
> ---
>
> Key: SPARK-16796
> URL: https://issues.apache.org/jira/browse/SPARK-16796
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Artur
>Assignee: Artur
>Priority: Trivial
> Attachments: 
> Mask_spark_ssl_keyPassword_spark_ssl_keyStorePassword_spark_ssl_trustStorePassword_from_We1.patch
>
>
> Spark properties (passwords): 
> spark.ssl.keyPassword,spark.ssl.keyStorePassword,spark.ssl.trustStorePassword
> are visible in Web UI in environment page.
> Can we mask them from Web UI?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16858) Removal of TestHiveSharedState

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16858:


Assignee: (was: Apache Spark)

> Removal of TestHiveSharedState
> --
>
> Key: SPARK-16858
> URL: https://issues.apache.org/jira/browse/SPARK-16858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Remove TestHiveSharedState. Otherwise, we are not really testing the 
> reflection logic based on the setting of we are not really testing the 
> reflection logic based on the setting of CATALOG_IMPLEMENTATION.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16858) Removal of TestHiveSharedState

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404464#comment-15404464
 ] 

Apache Spark commented on SPARK-16858:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14463

> Removal of TestHiveSharedState
> --
>
> Key: SPARK-16858
> URL: https://issues.apache.org/jira/browse/SPARK-16858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Remove TestHiveSharedState. Otherwise, we are not really testing the 
> reflection logic based on the setting of we are not really testing the 
> reflection logic based on the setting of CATALOG_IMPLEMENTATION.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16858) Removal of TestHiveSharedState

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16858:


Assignee: Apache Spark

> Removal of TestHiveSharedState
> --
>
> Key: SPARK-16858
> URL: https://issues.apache.org/jira/browse/SPARK-16858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Remove TestHiveSharedState. Otherwise, we are not really testing the 
> reflection logic based on the setting of we are not really testing the 
> reflection logic based on the setting of CATALOG_IMPLEMENTATION.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16858) Removal of TestHiveSharedState

2016-08-02 Thread Xiao Li (JIRA)
Xiao Li created SPARK-16858:
---

 Summary: Removal of TestHiveSharedState
 Key: SPARK-16858
 URL: https://issues.apache.org/jira/browse/SPARK-16858
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Remove TestHiveSharedState. Otherwise, we are not really testing the reflection 
logic based on the setting of we are not really testing the reflection logic 
based on the setting of CATALOG_IMPLEMENTATION.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-08-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15639:
---
Target Version/s: 2.0.1

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-08-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15639:
---
Priority: Blocker  (was: Major)

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16816:
--
Assignee: sandeep purohit

> Add documentation to create JavaSparkContext from SparkSession
> --
>
> Key: SPARK-16816
> URL: https://issues.apache.org/jira/browse/SPARK-16816
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: sandeep purohit
>Assignee: sandeep purohit
>Priority: Trivial
> Fix For: 2.1.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> In this issue the user can know how to create the JavaSparkContext with spark 
> session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16816.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14436
[https://github.com/apache/spark/pull/14436]

> Add documentation to create JavaSparkContext from SparkSession
> --
>
> Key: SPARK-16816
> URL: https://issues.apache.org/jira/browse/SPARK-16816
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: sandeep purohit
>Priority: Trivial
> Fix For: 2.1.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> In this issue the user can know how to create the JavaSparkContext with spark 
> session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16850) Improve error message for greatest/least

2016-08-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16850:

Fix Version/s: 2.0.1

> Improve error message for greatest/least
> 
>
> Key: SPARK-16850
> URL: https://issues.apache.org/jira/browse/SPARK-16850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> Greatest/least function does not have the most friendly error message for 
> data types:
> Error in SQL statement: AnalysisException: cannot resolve 'greatest(CAST(1.0 
> AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should 
> all have the same type, got GREATEST (ArrayBuffer(DecimalType(2,1), 
> StringType)).; line 1 pos 7
> We should report the human readable data type instead, rather than having 
> "ArrayBuffer" and "StringType".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-08-02 Thread Ryan Claussen (JIRA)
Ryan Claussen created SPARK-16857:
-

 Summary: CrossValidator and KMeans throws IllegalArgumentException
 Key: SPARK-16857
 URL: https://issues.apache.org/jira/browse/SPARK-16857
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.6.1
 Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
Hadoop 2.4
Reporter: Ryan Claussen


I am attempting to use CrossValidation to train KMeans model. When I attempt to 
fit the data spark throws an IllegalArgumentException as below since the KMeans 
algorithm outputs an Integer into the prediction column instead of a Double.   
Before I go too far:  is using CrossValidation with Kmeans supported?

Here's the exception:
{quote}
java.lang.IllegalArgumentException: requirement failed: Column prediction must 
be of type DoubleType but was actually IntegerType.
 at scala.Predef$.require(Predef.scala:233)
 at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
 at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
 at 
org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
 at 
org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
 at 
com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
 at 
com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
 at 
com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
 at 
spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
 at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
 at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
{quote}

Here is the code I'm using to set up my cross validator.  As the stack trace 
above indicates it is failing at the fit step when 
{quote}
...
val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
val labelConverter = new 
IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, 
mpc, labelConverter))

val evaluator = new 
MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")

val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 200, 
500)).build()
val cv = new 
CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
val cvModel = cv.fit(trainingData)
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16836) Hive date/time function error

2016-08-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16836.
-
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.1.0
   2.0.1

> Hive date/time function error
> -
>
> Key: SPARK-16836
> URL: https://issues.apache.org/jira/browse/SPARK-16836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jesse Lord
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> Previously available hive functions for date/time are not available in Spark 
> 2.0 (e.g. current_date, current_timestamp). These functions work in Spark 
> 1.6.2 with HiveContext.
> Example (from spark-shell):
> {noformat}
> scala> spark.sql("select current_date")
> org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given 
> input columns: []; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   ... 48 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16062) PySpark SQL python-only UDTs don't work well

2016-08-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16062.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 13778
[https://github.com/apache/spark/pull/13778]

> PySpark SQL python-only UDTs don't work well
> 
>
> Key: SPARK-16062
> URL: https://issues.apache.org/jira/browse/SPARK-16062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.1, 2.1.0
>
>
> Python-only UDTs can't work well. One example is:
> {code}
> import pyspark.sql.group
> from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
> from pyspark.sql.types import *
> schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT())
> df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for 
> i in range(10)], schema=schema)
> df.collect()  # this works
> df.show() # this doesn't work
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15989) PySpark SQL python-only UDTs don't support nested types

2016-08-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15989.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 13778
[https://github.com/apache/spark/pull/13778]

> PySpark SQL python-only UDTs don't support nested types
> ---
>
> Key: SPARK-15989
> URL: https://issues.apache.org/jira/browse/SPARK-15989
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
> Fix For: 2.0.1, 2.1.0
>
>
> [This 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/611202526513296/1653464426712019/latest.html]
>  demonstrates the bug.
> The obvious issue is that nested UDTs are not supported if the UDT is 
> Python-only. Looking at the exception thrown, this seems to be because the 
> encoder on the Java end tries to encode the UDT as a Java class, which 
> doesn't exist for the [[PythonOnlyUDT]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404369#comment-15404369
 ] 

Yin Huai commented on SPARK-16320:
--

Can you also try https://github.com/apache/spark/pull/13701 and see if it makes 
any difference?

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404369#comment-15404369
 ] 

Yin Huai edited comment on SPARK-16320 at 8/2/16 4:52 PM:
--

[~maver1ck]

Can you also try https://github.com/apache/spark/pull/13701 and see if it makes 
any difference?


was (Author: yhuai):
Can you also try https://github.com/apache/spark/pull/13701 and see if it makes 
any difference?

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404361#comment-15404361
 ] 

Michael Allman commented on SPARK-16320:


[~maver1ck] I'm having trouble reproducing your problem. I would like to help, 
but I need a dataset to work with. Can you please post a test dataset (to S3) 
and query in which you see the regression?

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16856) Link application summary page and detail page to the master page

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404331#comment-15404331
 ] 

Apache Spark commented on SPARK-16856:
--

User 'nblintao' has created a pull request for this issue:
https://github.com/apache/spark/pull/14461

> Link application summary page and detail page to the master page
> 
>
> Key: SPARK-16856
> URL: https://issues.apache.org/jira/browse/SPARK-16856
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Reporter: Tao Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16856) Link application summary page and detail page to the master page

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16856:


Assignee: (was: Apache Spark)

> Link application summary page and detail page to the master page
> 
>
> Key: SPARK-16856
> URL: https://issues.apache.org/jira/browse/SPARK-16856
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Reporter: Tao Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16856) Link application summary page and detail page to the master page

2016-08-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16856:


Assignee: Apache Spark

> Link application summary page and detail page to the master page
> 
>
> Key: SPARK-16856
> URL: https://issues.apache.org/jira/browse/SPARK-16856
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Reporter: Tao Lin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16835) LinearRegression LogisticRegression AFTSuvivalRegression should unpersist input training data when exception throws

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16835.
---
Resolution: Won't Fix

> LinearRegression LogisticRegression AFTSuvivalRegression should unpersist 
> input training data when exception throws
> ---
>
> Key: SPARK-16835
> URL: https://issues.apache.org/jira/browse/SPARK-16835
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, 
> LinearRegression LogisticRegression AFTSuvivalRegression' s `fit` interface 
> will persist the `dataframe` passed in if it is not persisted.
> but when possible exception throws, it won't be unpersist.
> As `fit` is a public interface, it is better to unpersist the parameter 
> object passed in in any case, if it do `persist` internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16837:
--
Assignee: Tom Magrino

> TimeWindow incorrectly drops slideDuration in constructors
> --
>
> Key: SPARK-16837
> URL: https://issues.apache.org/jira/browse/SPARK-16837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tom Magrino
>Assignee: Tom Magrino
> Fix For: 2.0.1, 2.1.0
>
>
> Right now, the constructors for the TimeWindow expression in Catalyst 
> incorrectly uses the windowDuration in place of the slideDuration.  This will 
> cause incorrect windowing semantics after time window expressions are 
> analyzed by Catalyst.
> Relevant code is here: 
> https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16837.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14441
[https://github.com/apache/spark/pull/14441]

> TimeWindow incorrectly drops slideDuration in constructors
> --
>
> Key: SPARK-16837
> URL: https://issues.apache.org/jira/browse/SPARK-16837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tom Magrino
> Fix For: 2.0.1, 2.1.0
>
>
> Right now, the constructors for the TimeWindow expression in Catalyst 
> incorrectly uses the windowDuration in place of the slideDuration.  This will 
> cause incorrect windowing semantics after time window expressions are 
> analyzed by Catalyst.
> Relevant code is here: 
> https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16822) Support latex in scaladoc with MathJax

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16822:
--
Assignee: Shuai Lin

> Support latex in scaladoc with MathJax
> --
>
> Key: SPARK-16822
> URL: https://issues.apache.org/jira/browse/SPARK-16822
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Shuai Lin
>Assignee: Shuai Lin
>Priority: Minor
> Fix For: 2.1.0
>
>
> The scaladoc of some classes (mainly ml/mllib classes) include math formulas, 
> but currently it renders very ugly, e.g. [the doc of the LogisticGradient 
> class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient].
> We can improve this by including MathJax javascripts in the scaladocs page, 
> much like what we do for the markdown docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16822) Support latex in scaladoc with MathJax

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16822.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14438
[https://github.com/apache/spark/pull/14438]

> Support latex in scaladoc with MathJax
> --
>
> Key: SPARK-16822
> URL: https://issues.apache.org/jira/browse/SPARK-16822
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Shuai Lin
>Priority: Minor
> Fix For: 2.1.0
>
>
> The scaladoc of some classes (mainly ml/mllib classes) include math formulas, 
> but currently it renders very ugly, e.g. [the doc of the LogisticGradient 
> class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient].
> We can improve this by including MathJax javascripts in the scaladocs page, 
> much like what we do for the markdown docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16856) Link application summary page and detail page to the master page

2016-08-02 Thread Tao Lin (JIRA)
Tao Lin created SPARK-16856:
---

 Summary: Link application summary page and detail page to the 
master page
 Key: SPARK-16856
 URL: https://issues.apache.org/jira/browse/SPARK-16856
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Tao Lin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16520) Link executors to corresponding worker pages

2016-08-02 Thread Tao Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Lin updated SPARK-16520:

Priority: Major  (was: Minor)

> Link executors to corresponding worker pages
> 
>
> Key: SPARK-16520
> URL: https://issues.apache.org/jira/browse/SPARK-16520
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Tao Lin
>
> In the executor page, we add links from executors to their corresponding 
> worker pages. Basically, we add links to the address or executor IDs linking 
> to their corresponding worker pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15541) SparkContext.stop throws error

2016-08-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15541.
---
Resolution: Resolved
  Assignee: Maciej Bryński

Resolved by https://github.com/apache/spark/pull/14459

> SparkContext.stop throws error
> --
>
> Key: SPARK-15541
> URL: https://issues.apache.org/jira/browse/SPARK-15541
> Project: Spark
>  Issue Type: Bug
>Reporter: Miao Wang
>Assignee: Maciej Bryński
>
> When running unit-tests or examples from command line or Intellij, 
> SparkContext throws errors.
> For example:
> ./bin/run-example ml.JavaNaiveBayesExample
> Exception in thread "main" 16/05/25 15:17:55 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> java.lang.NoSuchMethodError: 
> java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
>   at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:176)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:291)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:269)
>   at org.apache.spark.SparkEnv.stop(SparkEnv.scala:91)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1796)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1795)
>   at org.apache.spark.sql.SparkSession.stop(SparkSession.scala:577)
>   at 
> org.apache.spark.examples.ml.JavaNaiveBayesExample.main(JavaNaiveBayesExample.java:61)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 16/05/25 15:17:55 INFO ShutdownHookManager: Shutdown hook called



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16854) mapWithState Support for Python

2016-08-02 Thread Boaz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404217#comment-15404217
 ] 

Boaz commented on SPARK-16854:
--

IMHO, streaming in python would be incomplete without mapWithState. Finding it 
hard to use updateStateByKey for large states.

> mapWithState Support for Python
> ---
>
> Key: SPARK-16854
> URL: https://issues.apache.org/jira/browse/SPARK-16854
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Boaz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12650) No means to specify Xmx settings for spark-submit in cluster deploy mode for Spark on YARN

2016-08-02 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-12650:

Summary: No means to specify Xmx settings for spark-submit in cluster 
deploy mode for Spark on YARN  (was: No means to specify Xmx settings for 
SparkSubmit in cluster deploy mode for Spark on YARN)

> No means to specify Xmx settings for spark-submit in cluster deploy mode for 
> Spark on YARN
> --
>
> Key: SPARK-12650
> URL: https://issues.apache.org/jira/browse/SPARK-12650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.2
> Environment: Hadoop 2.6.0
>Reporter: John Vines
>
> Background-
> I have an app master designed to do some work and then launch a spark job.
> Issue-
> If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, 
> leading to the jvm taking a default heap which is relatively large. This 
> causes a large amount of vmem to be taken, so that it is killed by yarn. This 
> can be worked around by disabling Yarn's vmem check, but that is a hack.
> If I run it in yarn-client mode, it's fine as long as my container has enough 
> space for the driver, which is manageable. But I feel that the utter lack of 
> Xmx settings for what I believe is a very small jvm is a problem.
> I believe this was introduced with the fix for SPARK-3884



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in cluster deploy mode for Spark on YARN

2016-08-02 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-12650:

Summary: No means to specify Xmx settings for SparkSubmit in cluster deploy 
mode for Spark on YARN  (was: No means to specify Xmx settings for SparkSubmit 
in yarn-cluster mode)

> No means to specify Xmx settings for SparkSubmit in cluster deploy mode for 
> Spark on YARN
> -
>
> Key: SPARK-12650
> URL: https://issues.apache.org/jira/browse/SPARK-12650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.2
> Environment: Hadoop 2.6.0
>Reporter: John Vines
>
> Background-
> I have an app master designed to do some work and then launch a spark job.
> Issue-
> If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, 
> leading to the jvm taking a default heap which is relatively large. This 
> causes a large amount of vmem to be taken, so that it is killed by yarn. This 
> can be worked around by disabling Yarn's vmem check, but that is a hack.
> If I run it in yarn-client mode, it's fine as long as my container has enough 
> space for the driver, which is manageable. But I feel that the utter lack of 
> Xmx settings for what I believe is a very small jvm is a problem.
> I believe this was introduced with the fix for SPARK-3884



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable

2016-08-02 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404175#comment-15404175
 ] 

Jacek Laskowski commented on SPARK-14453:
-

Is anyone working on it? Just found few places in Spark on YARN that should 
long have been removed (only introduce some noise in the code). I'd like to 
work on this if possible.

> Remove SPARK_JAVA_OPTS environment variable
> ---
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable

2016-08-02 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-14453:

Issue Type: Task  (was: Bug)

> Remove SPARK_JAVA_OPTS environment variable
> ---
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable

2016-08-02 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-14453:

Summary: Remove SPARK_JAVA_OPTS environment variable  (was: Consider 
removing SPARK_JAVA_OPTS env variable)

> Remove SPARK_JAVA_OPTS environment variable
> ---
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16852) RejectedExecutionException when exit at some times

2016-08-02 Thread Weizhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404166#comment-15404166
 ] 

Weizhong commented on SPARK-16852:
--

I run 2T tpcds, and some times will print the stack.
{noformat}
cases="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99"
i=1
total=99

for t in ${cases}
do
echo "run sql99 sample query $t ($i/$total)."
spark-sql --master yarn-client --driver-memory 30g --executor-memory 30g 
--num-executors 6 --executor-cores 5  -f query/query${t}.sql > logs/${t}.result 
2> logs/${t}.log
i=`expr $i + 1`
done

{noformat}

> RejectedExecutionException when exit at some times
> --
>
> Key: SPARK-16852
> URL: https://issues.apache.org/jira/browse/SPARK-16852
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Weizhong
>Priority: Minor
>
> If we run a huge job, some times when exit will print 
> RejectedExecutionException
> {noformat}
> 16/05/27 08:30:40 ERROR client.TransportResponseHandler: Still have 3 
> requests outstanding when connection from HGH117808/10.184.66.104:41980 
> is closed
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@6b66dba rejected from 
> java.util.concurrent.ThreadPoolExecutor@60725736[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 269]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.tryFailure(Promise.scala:115)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$1.apply(NettyRpcEnv.scala:214)
>   at 
> 

[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404114#comment-15404114
 ] 

Maciej Bryński edited comment on SPARK-16320 at 8/2/16 3:05 PM:


[~clockfly]
I tested your patch. Results are equal to Spark 2.0 without the patch.

But I have one more observation.
For every Stage there is "Input" column in Spark-UI.
For Spark 2.0 it shows up to 11 GB.
For Spark 1.6 only 2.8 GB.
Maybe this is connected ?

I attached two screenshots from UI.

PS. I also tested spark.sql.codegen.maxFields parameter. I force Spark to use 
whole stage generation. But it makes almost no difference


was (Author: maver1ck):
[~clockfly]
I tested your patch. Results are equal to Spark 2.0 without the patch.

But I have one more observation.
For every Stage there is "Input" column in Spark-UI.
For Spark 2.0 it shows up to 11 GB.
For Spark 1.6 only 2.8 GB.
Maybe this is connected ?

I attached two screenshots from UI.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >