[jira] [Resolved] (SPARK-6548) stddev_pop and stddev_samp aggregate functions

2015-09-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-6548.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 6297
[https://github.com/apache/spark/pull/6297]

> stddev_pop and stddev_samp aggregate functions
> --
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
> Fix For: 1.6.0
>
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10578) pyspark.ml.classification.RandomForestClassifer does not return `rawPrediction` column

2015-09-12 Thread Karen Yin-Yee Ng (JIRA)
Karen Yin-Yee Ng created SPARK-10578:


 Summary: pyspark.ml.classification.RandomForestClassifer does not 
return `rawPrediction` column
 Key: SPARK-10578
 URL: https://issues.apache.org/jira/browse/SPARK-10578
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.1, 1.4.0
 Environment: CentOS, PySpark 1.4.1, Scala 2.10 
Reporter: Karen Yin-Yee Ng


To use `pyspark.ml.classification.RandomForestClassifer` with 
`BinaryClassificationEvaluator`, a column called `rawPrediction` needs to be 
returned by the `RandomForestClassifer`. 
The PySpark documentation example of `logisticsRegression`outputs the 
`rawPrediction` column but not `RandomForestClassifier`.

Therefore, one is unable to use `RandomForestClassifier` with the evaluator nor 
put it in a pipeline with cross validation.

A relevant piece of code showing how to reproduce the bug can be found at:
https://gist.github.com/karenyyng/cf61ae655b032f754bfb

A relevant post due to this possible bug can also be found at:
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-td23791.html





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10401) spark-submit --unsupervise

2015-09-12 Thread Sanket Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742252#comment-14742252
 ] 

Sanket Reddy commented on SPARK-10401:
--

I would like to work on it

> spark-submit --unsupervise 
> ---
>
> Key: SPARK-10401
> URL: https://issues.apache.org/jira/browse/SPARK-10401
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos
>Affects Versions: 1.5.0
>Reporter: Alberto Miorin
>
> When I submit a streaming job with the option --supervise to the new mesos 
> spark dispatcher, I cannot decommission the job.
> I tried spark-submit --kill, but dispatcher always restarts it.
> Driver and Executors are both Docker containers.
> I think there should be a subcommand spark-submit --unsupervise 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10429) MutableProjection should evaluate all expressions first and then update the mutable row

2015-09-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10429:
-
Assignee: Wenchen Fan

> MutableProjection should evaluate all expressions first and then update the 
> mutable row
> ---
>
> Key: SPARK-10429
> URL: https://issues.apache.org/jira/browse/SPARK-10429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Wenchen Fan
>Priority: Blocker
>
> Right now, SQL's mutable projection updates every value of the mutable 
> project after it evaluates the corresponding expression. This makes the 
> behavior of MutableProjection confusing and complicate the implementation of 
> common aggregate functions like stddev because developers need to be aware 
> that when evaluating {{i+1}}th expression of a mutable projection, {{i}}th 
> slot of the mutable row has already been updated.
> A better behavior of MutableProjection will be that we evaluate all 
> expressions first and then update all values of the mutable row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10429) MutableProjection should evaluate all expressions first and then update the mutable row

2015-09-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742251#comment-14742251
 ] 

Yin Huai edited comment on SPARK-10429 at 9/12/15 10:50 PM:


[Here | 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala#L41-L66]
 and [here | 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L79-L87]
 are places that needs to be changed.

The main work is to handle cases when [we need to split generated 
code|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala#L66].
 When we need to split generated code, we need to [create mutable 
states|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L80-L82]
 for evaluated expression result. 


was (Author: yhuai):
[Here | 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala#L41-L66]
 and [here | 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L79-L87]
 are places that needs to be changed.

> MutableProjection should evaluate all expressions first and then update the 
> mutable row
> ---
>
> Key: SPARK-10429
> URL: https://issues.apache.org/jira/browse/SPARK-10429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Right now, SQL's mutable projection updates every value of the mutable 
> project after it evaluates the corresponding expression. This makes the 
> behavior of MutableProjection confusing and complicate the implementation of 
> common aggregate functions like stddev because developers need to be aware 
> that when evaluating {{i+1}}th expression of a mutable projection, {{i}}th 
> slot of the mutable row has already been updated.
> A better behavior of MutableProjection will be that we evaluate all 
> expressions first and then update all values of the mutable row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10429) MutableProjection should evaluate all expressions first and then update the mutable row

2015-09-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742251#comment-14742251
 ] 

Yin Huai edited comment on SPARK-10429 at 9/12/15 10:37 PM:


[Here | 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala#L41-L66]
 and [here | 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L79-L87]
 are places that needs to be changed.


was (Author: yhuai):
[Here | 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala#L41-L66]
 is the code block that needs to be changed.

> MutableProjection should evaluate all expressions first and then update the 
> mutable row
> ---
>
> Key: SPARK-10429
> URL: https://issues.apache.org/jira/browse/SPARK-10429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Right now, SQL's mutable projection updates every value of the mutable 
> project after it evaluates the corresponding expression. This makes the 
> behavior of MutableProjection confusing and complicate the implementation of 
> common aggregate functions like stddev because developers need to be aware 
> that when evaluating {{i+1}}th expression of a mutable projection, {{i}}th 
> slot of the mutable row has already been updated.
> A better behavior of MutableProjection will be that we evaluate all 
> expressions first and then update all values of the mutable row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10401) spark-submit --unsupervise

2015-09-12 Thread Sanket Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-10401:
-
Comment: was deleted

(was: I would like to work on it)

> spark-submit --unsupervise 
> ---
>
> Key: SPARK-10401
> URL: https://issues.apache.org/jira/browse/SPARK-10401
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos
>Affects Versions: 1.5.0
>Reporter: Alberto Miorin
>
> When I submit a streaming job with the option --supervise to the new mesos 
> spark dispatcher, I cannot decommission the job.
> I tried spark-submit --kill, but dispatcher always restarts it.
> Driver and Executors are both Docker containers.
> I think there should be a subcommand spark-submit --unsupervise 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10330) Use SparkHadoopUtil TaskAttemptContext reflection methods in more places

2015-09-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10330.

   Resolution: Fixed
 Assignee: Josh Rosen
Fix Version/s: 1.6.0

Fixed by my PR for 1.6.0.

> Use SparkHadoopUtil TaskAttemptContext reflection methods in more places
> 
>
> Key: SPARK-10330
> URL: https://issues.apache.org/jira/browse/SPARK-10330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> SparkHadoopUtil contains methods that use reflection to work around 
> TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We 
> should use these methods in more places.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10429) MutableProjection should evaluate all expressions first and then update the mutable row

2015-09-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742251#comment-14742251
 ] 

Yin Huai commented on SPARK-10429:
--

[Here | 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala#L41-L66]
 is the code block that needs to be changed.

> MutableProjection should evaluate all expressions first and then update the 
> mutable row
> ---
>
> Key: SPARK-10429
> URL: https://issues.apache.org/jira/browse/SPARK-10429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Right now, SQL's mutable projection updates every value of the mutable 
> project after it evaluates the corresponding expression. This makes the 
> behavior of MutableProjection confusing and complicate the implementation of 
> common aggregate functions like stddev because developers need to be aware 
> that when evaluating {{i+1}}th expression of a mutable projection, {{i}}th 
> slot of the mutable row has already been updated.
> A better behavior of MutableProjection will be that we evaluate all 
> expressions first and then update all values of the mutable row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6548) stddev_pop and stddev_samp aggregate functions

2015-09-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742203#comment-14742203
 ] 

Reynold Xin commented on SPARK-6548:


[~davies]  the patch that's merged is not using the new aggregate interface, 
but the old one that is to be removed in 1.6?


> stddev_pop and stddev_samp aggregate functions
> --
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
> Fix For: 1.6.0
>
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10557) Publish Spark 1.5.0 on Maven central

2015-09-12 Thread Marko Asplund (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742209#comment-14742209
 ] 

Marko Asplund commented on SPARK-10557:
---

thanks! (y)

> Publish Spark 1.5.0 on Maven central
> 
>
> Key: SPARK-10557
> URL: https://issues.apache.org/jira/browse/SPARK-10557
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Marko Asplund
>
> Spark v1.5.0 has been officially released, but not published on Maven central.
> https://spark.apache.org/releases/spark-release-1-5-0.html
> Also, in Jira 1.5.0 is listed under "unreleased" version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-09-12 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742161#comment-14742161
 ] 

Bertrand Dechoux commented on SPARK-9720:
-

The pull request can be merged.

> spark.ml Identifiable types should have UID in toString methods
> ---
>
> Key: SPARK-9720
> URL: https://issues.apache.org/jira/browse/SPARK-9720
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Bertrand Dechoux
>Priority: Minor
>  Labels: starter
>
> It would be nice to include the UID (instance name) in toString methods.  
> That's the default behavior for Identifiable, but some types override the 
> default toString and do not include the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10576) Move .java files out of src/main/scala

2015-09-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742280#comment-14742280
 ] 

Patrick Wendell commented on SPARK-10576:
-

FWIW - seems to me like moving them into /java makes sense. If we are going to 
have src/main/scala and src/main/java, might as well use them correctly. What 
do you think [~rxin].

> Move .java files out of src/main/scala
> --
>
> Key: SPARK-10576
> URL: https://issues.apache.org/jira/browse/SPARK-10576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Priority: Minor
>
> (I suppose I'm really asking for an opinion on this, rather than asserting it 
> must be done, but seems worthwhile. CC [~rxin] and [~pwendell])
> As pointed out on the mailing list, there are some Java files in the Scala 
> source tree:
> {code}
> ./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
> ./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
> ./core/src/main/scala/org/apache/spark/annotation/Experimental.java
> ./core/src/main/scala/org/apache/spark/annotation/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/Private.java
> ./core/src/main/scala/org/apache/spark/api/java/package-info.java
> ./core/src/main/scala/org/apache/spark/broadcast/package-info.java
> ./core/src/main/scala/org/apache/spark/executor/package-info.java
> ./core/src/main/scala/org/apache/spark/io/package-info.java
> ./core/src/main/scala/org/apache/spark/rdd/package-info.java
> ./core/src/main/scala/org/apache/spark/scheduler/package-info.java
> ./core/src/main/scala/org/apache/spark/serializer/package-info.java
> ./core/src/main/scala/org/apache/spark/util/package-info.java
> ./core/src/main/scala/org/apache/spark/util/random/package-info.java
> ./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
> ./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
> ./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
> ./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
> ./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
> ./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
> ./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/package-info.java
> ./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
> ./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
> ./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
> {code}
> It happens to work since the Scala compiler plugin is handling both.
> On its face, they should be in the Java source tree. I'm trying to figure out 
> if there are good reasons they have to be in this less intuitive location.
> I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10579) Extend statistical functions: Add Cardinality/Quantiles/Quartiles/Median in Statistics , e.g. for columns

2015-09-12 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-10579:
-

 Summary: Extend statistical functions: Add 
Cardinality/Quantiles/Quartiles/Median in Statistics , e.g. for columns
 Key: SPARK-10579
 URL: https://issues.apache.org/jira/browse/SPARK-10579
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Narine Kokhlikyan
Priority: Minor
 Fix For: 1.6.0


Hi everyone,

I think it would be good to extend statistical functions in mllib package, by 
adding  Cardinality/Quantiles/Quartiles/Median for the columns, as many other 
ml and statistical libraries already have it. I couldn't find it in mllib 
package, hence would like to suggest it.

Since this is my first time working with jira, I'd truly appreciate if someone 
could review this and let me know what do you think. 

Also, I'd really like to work on it and looking forward to hearing from you!

Thanks,
Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10575) Wrap RDD.takeSample with scope

2015-09-12 Thread Vinod KC (JIRA)
Vinod KC created SPARK-10575:


 Summary: Wrap RDD.takeSample with scope
 Key: SPARK-10575
 URL: https://issues.apache.org/jira/browse/SPARK-10575
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Vinod KC
Priority: Minor


Remove return statements in RDD.takeSample and wrap it withScope



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10575) Wrap RDD.takeSample with scope

2015-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10575:


Assignee: (was: Apache Spark)

> Wrap RDD.takeSample with scope
> --
>
> Key: SPARK-10575
> URL: https://issues.apache.org/jira/browse/SPARK-10575
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Vinod KC
>Priority: Minor
>
> Remove return statements in RDD.takeSample and wrap it withScope



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10575) Wrap RDD.takeSample with scope

2015-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741961#comment-14741961
 ] 

Apache Spark commented on SPARK-10575:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/8730

> Wrap RDD.takeSample with scope
> --
>
> Key: SPARK-10575
> URL: https://issues.apache.org/jira/browse/SPARK-10575
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Vinod KC
>Priority: Minor
>
> Remove return statements in RDD.takeSample and wrap it withScope



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10575) Wrap RDD.takeSample with scope

2015-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10575:


Assignee: Apache Spark

> Wrap RDD.takeSample with scope
> --
>
> Key: SPARK-10575
> URL: https://issues.apache.org/jira/browse/SPARK-10575
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Vinod KC
>Assignee: Apache Spark
>Priority: Minor
>
> Remove return statements in RDD.takeSample and wrap it withScope



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10566) SnappyCompressionCodec init exception handling masks important error information

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10566:
--
Assignee: Daniel Imfeld

> SnappyCompressionCodec init exception handling masks important error 
> information
> 
>
> Key: SPARK-10566
> URL: https://issues.apache.org/jira/browse/SPARK-10566
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
>Reporter: Daniel Imfeld
>Assignee: Daniel Imfeld
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> The change to always throw an IllegalArgumentException when failing to load 
> in SnappyCompressionCodec (CompressionCodec.scala:151) throws away the 
> description from the exception thrown, which makes it really difficult to 
> actually figure out what the problem is:
> : java.lang.IllegalArgumentException
>   at 
> org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:151)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> Removing this try...catch I get the following error, which actually gives 
> some information about how to fix the problem:
> : java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at 
> org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:178)
>   at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:152)
> A change to initialize the IllegalArgumentException with the value of 
> e.getMessage() would be great, as the current error without any description 
> just leads to a lot of frustrating guesswork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10518) Update code examples in spark.ml user guide to use LIBSVM data source instead of MLUtils

2015-09-12 Thread shimizu yoshihiro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741922#comment-14741922
 ] 

shimizu yoshihiro commented on SPARK-10518:
---

[~mengxr] Thank you for review on github. Here is my account name. Thanks!

> Update code examples in spark.ml user guide to use LIBSVM data source instead 
> of MLUtils
> 
>
> Key: SPARK-10518
> URL: https://issues.apache.org/jira/browse/SPARK-10518
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>
> SPARK-10117 was merged, we should use LIBSVM data source in the example code 
> in spark.ml user guide, e.g.,
> {code}
> val df = sqlContext.read.format("libsvm").load("path")
> {code}
> instead of
> {code}
> val df = MLUtils.loadLibSVMFile(sc, "path").toDF()
> {code}
> We should update the following:
> {code}
> ml-ensembles.md:40:val data = MLUtils.loadLibSVMFile(sc,
> ml-ensembles.md:87:RDD data = MLUtils.loadLibSVMFile(jsc.sc(),
> ml-features.md:866:val data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt").toDF()
> ml-features.md:892:JavaRDD rdd = MLUtils.loadLibSVMFile(sc.sc(),
> ml-features.md:917:data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt").toDF()
> ml-features.md:940:val data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt")
> ml-features.md:964:  MLUtils.loadLibSVMFile(jsc.sc(), 
> "data/mllib/sample_libsvm_data.txt").toJavaRDD();
> ml-features.md:985:data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt")
> ml-features.md:1022:val data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt")
> ml-features.md:1047:  MLUtils.loadLibSVMFile(jsc.sc(), 
> "data/mllib/sample_libsvm_data.txt").toJavaRDD();
> ml-features.md:1068:data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt")
> ml-linear-methods.md:44:val training = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt").toDF()
> ml-linear-methods.md:84:DataFrame training = 
> sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), 
> LabeledPoint.class);
> ml-linear-methods.md:110:training = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt").toDF()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10566) SnappyCompressionCodec init exception handling masks important error information

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10566.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8725
[https://github.com/apache/spark/pull/8725]

> SnappyCompressionCodec init exception handling masks important error 
> information
> 
>
> Key: SPARK-10566
> URL: https://issues.apache.org/jira/browse/SPARK-10566
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
>Reporter: Daniel Imfeld
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> The change to always throw an IllegalArgumentException when failing to load 
> in SnappyCompressionCodec (CompressionCodec.scala:151) throws away the 
> description from the exception thrown, which makes it really difficult to 
> actually figure out what the problem is:
> : java.lang.IllegalArgumentException
>   at 
> org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:151)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> Removing this try...catch I get the following error, which actually gives 
> some information about how to fix the problem:
> : java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>   at java.lang.System.loadLibrary(System.java:1122)
>   at 
> org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:178)
>   at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:152)
> A change to initialize the IllegalArgumentException with the value of 
> e.getMessage() would be great, as the current error without any description 
> just leads to a lot of frustrating guesswork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10554) Potential NPE with ShutdownHook

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10554:
--
Assignee: Nithin Asokan

> Potential NPE with ShutdownHook
> ---
>
> Key: SPARK-10554
> URL: https://issues.apache.org/jira/browse/SPARK-10554
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.5.0
>Reporter: Nithin Asokan
>Assignee: Nithin Asokan
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> Originally posted in user mailing list 
> [here|http://apache-spark-user-list.1001560.n3.nabble.com/Potential-NPE-while-exiting-spark-shell-tt24523.html]
> I'm currently using Spark 1.3.0 on yarn cluster deployed through CDH5.4. My 
> cluster does not have a 'default' queue, and launching 'spark-shell' submits 
> an yarn application that gets killed immediately because queue does not 
> exist. However, the spark-shell session is still in progress after throwing a 
> bunch of errors while creating sql context. Upon submitting an 'exit' 
> command, there appears to be a NPE from DiskBlockManager with the following 
> stack trace 
> {code}
> ERROR Utils: Uncaught exception in thread delete Spark local dirs 
> java.lang.NullPointerException 
> at 
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:161)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:141)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:139)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:139)
>  
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) 
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139)
>  
> Exception in thread "delete Spark local dirs" java.lang.NullPointerException 
> at 
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:161)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:141)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:139)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:139)
>  
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) 
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139)
>  
> {code}
> I believe the problem appears to be surfacing from a shutdown hook that's 
> tries to cleanup local directories. In this specific case because the yarn 
> application was not submitted successfully, the block manager was not 
> registered; as a result it does not have a valid blockManagerId as seen here 
> https://github.com/apache/spark/blob/v1.3.0/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L161
> Has anyone faced this issue before? Could this be a problem with the way 
> shutdown hook behaves currently? 
> Note: I referenced source from apache spark repo than cloudera.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10554) Potential NPE with ShutdownHook

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10554.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8720
[https://github.com/apache/spark/pull/8720]

> Potential NPE with ShutdownHook
> ---
>
> Key: SPARK-10554
> URL: https://issues.apache.org/jira/browse/SPARK-10554
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.5.0
>Reporter: Nithin Asokan
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> Originally posted in user mailing list 
> [here|http://apache-spark-user-list.1001560.n3.nabble.com/Potential-NPE-while-exiting-spark-shell-tt24523.html]
> I'm currently using Spark 1.3.0 on yarn cluster deployed through CDH5.4. My 
> cluster does not have a 'default' queue, and launching 'spark-shell' submits 
> an yarn application that gets killed immediately because queue does not 
> exist. However, the spark-shell session is still in progress after throwing a 
> bunch of errors while creating sql context. Upon submitting an 'exit' 
> command, there appears to be a NPE from DiskBlockManager with the following 
> stack trace 
> {code}
> ERROR Utils: Uncaught exception in thread delete Spark local dirs 
> java.lang.NullPointerException 
> at 
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:161)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:141)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:139)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:139)
>  
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) 
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139)
>  
> Exception in thread "delete Spark local dirs" java.lang.NullPointerException 
> at 
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:161)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:141)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:139)
>  
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:139)
>  
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) 
> at 
> org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139)
>  
> {code}
> I believe the problem appears to be surfacing from a shutdown hook that's 
> tries to cleanup local directories. In this specific case because the yarn 
> application was not submitted successfully, the block manager was not 
> registered; as a result it does not have a valid blockManagerId as seen here 
> https://github.com/apache/spark/blob/v1.3.0/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L161
> Has anyone faced this issue before? Could this be a problem with the way 
> shutdown hook behaves currently? 
> Note: I referenced source from apache spark repo than cloudera.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10568) Error thrown in stopping one component in SparkContext.stop() doesn't allow other components to be stopped

2015-09-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741978#comment-14741978
 ] 

Sean Owen commented on SPARK-10568:
---

Yeah I can imagine some relatively painless Scala code that iterates over a 
bunch of closures that shutdown things and log exceptions and continue in the 
face of failures.  Ideally we check if shutdown happens in the reverse order of 
initialization too.

OK if the YARN issue is really outside Spark yes that's out of scope here.

> Error thrown in stopping one component in SparkContext.stop() doesn't allow 
> other components to be stopped
> --
>
> Key: SPARK-10568
> URL: https://issues.apache.org/jira/browse/SPARK-10568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Matt Cheah
>
> When I shut down a Java process that is running a SparkContext, it invokes a 
> shutdown hook that eventually calls SparkContext.stop(), and inside 
> SparkContext.stop() each individual component (DiskBlockManager, Scheduler 
> Backend) is stopped. If an exception is thrown in stopping one of these 
> components, none of the other components will be stopped cleanly either. This 
> caused problems when I stopped a Java process running a Spark context in 
> yarn-client mode, because not properly stopping YarnSchedulerBackend leads to 
> problems.
> The steps I ran are as follows:
> 1. Create one job which fills the cluster
> 2. Kick off another job which creates a Spark Context
> 3. Kill the Java process with the Spark Context in #2
> 4. The job remains in the YARN UI as ACCEPTED
> Looking in the logs we see the following:
> {code}
> 2015-09-07 10:32:43,446 ERROR [Thread-3] o.a.s.u.Utils - Uncaught exception 
> in thread Thread-3
> java.lang.NullPointerException: null
> at 
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at 
> org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:144)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308) 
> ~[spark-core_2.10-1.4.1.jar:1.4.1]
> {code}
> I think what's going on is that when we kill the application in the queued 
> state, it tries to run the SparkContext.stop() method on the driver and stop 
> each component. It dies trying to stop the DiskBlockManager because it hasn't 
> been initialized yet - the application is still waiting to be scheduled by 
> the Yarn RM - but YarnClient.stop() is not invoked as a result, leaving the 
> application sticking around in the accepted state.
> Because of what appears to be bugs in the YARN scheduler, entering this state 
> makes it so that the YARN scheduler is unable to schedule any more jobs 
> unless we manually remove this application via the YARN CLI. We can tackle 
> the YARN stuck state separately, but ensuring that all components get at 
> least some chance to stop when a SparkContext stops seems like a good idea. 
> Of course we can still throw some exception and/or log exceptions for 
> everything that goes wrong at the end of stopping the context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10568) Error thrown in stopping one component in SparkContext.stop() doesn't allow other components to be stopped

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10568:
--
Priority: Minor  (was: Major)

> Error thrown in stopping one component in SparkContext.stop() doesn't allow 
> other components to be stopped
> --
>
> Key: SPARK-10568
> URL: https://issues.apache.org/jira/browse/SPARK-10568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Matt Cheah
>Priority: Minor
>
> When I shut down a Java process that is running a SparkContext, it invokes a 
> shutdown hook that eventually calls SparkContext.stop(), and inside 
> SparkContext.stop() each individual component (DiskBlockManager, Scheduler 
> Backend) is stopped. If an exception is thrown in stopping one of these 
> components, none of the other components will be stopped cleanly either. This 
> caused problems when I stopped a Java process running a Spark context in 
> yarn-client mode, because not properly stopping YarnSchedulerBackend leads to 
> problems.
> The steps I ran are as follows:
> 1. Create one job which fills the cluster
> 2. Kick off another job which creates a Spark Context
> 3. Kill the Java process with the Spark Context in #2
> 4. The job remains in the YARN UI as ACCEPTED
> Looking in the logs we see the following:
> {code}
> 2015-09-07 10:32:43,446 ERROR [Thread-3] o.a.s.u.Utils - Uncaught exception 
> in thread Thread-3
> java.lang.NullPointerException: null
> at 
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at 
> org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:144)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308) 
> ~[spark-core_2.10-1.4.1.jar:1.4.1]
> {code}
> I think what's going on is that when we kill the application in the queued 
> state, it tries to run the SparkContext.stop() method on the driver and stop 
> each component. It dies trying to stop the DiskBlockManager because it hasn't 
> been initialized yet - the application is still waiting to be scheduled by 
> the Yarn RM - but YarnClient.stop() is not invoked as a result, leaving the 
> application sticking around in the accepted state.
> Because of what appears to be bugs in the YARN scheduler, entering this state 
> makes it so that the YARN scheduler is unable to schedule any more jobs 
> unless we manually remove this application via the YARN CLI. We can tackle 
> the YARN stuck state separately, but ensuring that all components get at 
> least some chance to stop when a SparkContext stops seems like a good idea. 
> Of course we can still throw some exception and/or log exceptions for 
> everything that goes wrong at the end of stopping the context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10547) Streamline / improve style of Java API tests

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10547.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8706
[https://github.com/apache/spark/pull/8706]

> Streamline / improve style of Java API tests
> 
>
> Key: SPARK-10547
> URL: https://issues.apache.org/jira/browse/SPARK-10547
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Tests
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.6.0
>
>
> I've wanted to touch up the style of the Java API tests. We've fixed some 
> issues recently but there are still some common issues in the code:
> - Unneeded generic types
> - Unneeded exception declaration
> - Unnecessary local vars
> - Assert args in wrong order
> It's not a big issue, but, PR coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10518) Update code examples in spark.ml user guide to use LIBSVM data source instead of MLUtils

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10518:
--
Assignee: shimizu yoshihiro

> Update code examples in spark.ml user guide to use LIBSVM data source instead 
> of MLUtils
> 
>
> Key: SPARK-10518
> URL: https://issues.apache.org/jira/browse/SPARK-10518
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: shimizu yoshihiro
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>
> SPARK-10117 was merged, we should use LIBSVM data source in the example code 
> in spark.ml user guide, e.g.,
> {code}
> val df = sqlContext.read.format("libsvm").load("path")
> {code}
> instead of
> {code}
> val df = MLUtils.loadLibSVMFile(sc, "path").toDF()
> {code}
> We should update the following:
> {code}
> ml-ensembles.md:40:val data = MLUtils.loadLibSVMFile(sc,
> ml-ensembles.md:87:RDD data = MLUtils.loadLibSVMFile(jsc.sc(),
> ml-features.md:866:val data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt").toDF()
> ml-features.md:892:JavaRDD rdd = MLUtils.loadLibSVMFile(sc.sc(),
> ml-features.md:917:data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt").toDF()
> ml-features.md:940:val data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt")
> ml-features.md:964:  MLUtils.loadLibSVMFile(jsc.sc(), 
> "data/mllib/sample_libsvm_data.txt").toJavaRDD();
> ml-features.md:985:data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt")
> ml-features.md:1022:val data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt")
> ml-features.md:1047:  MLUtils.loadLibSVMFile(jsc.sc(), 
> "data/mllib/sample_libsvm_data.txt").toJavaRDD();
> ml-features.md:1068:data = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt")
> ml-linear-methods.md:44:val training = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt").toDF()
> ml-linear-methods.md:84:DataFrame training = 
> sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), 
> LabeledPoint.class);
> ml-linear-methods.md:110:training = MLUtils.loadLibSVMFile(sc, 
> "data/mllib/sample_libsvm_data.txt").toDF()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10576) Move .java files out of src/main/scala

2015-09-12 Thread Sean Owen (JIRA)
Sean Owen created SPARK-10576:
-

 Summary: Move .java files out of src/main/scala
 Key: SPARK-10576
 URL: https://issues.apache.org/jira/browse/SPARK-10576
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.5.0
Reporter: Sean Owen
Priority: Minor


(I suppose I'm really asking for an opinion on this, rather than asserting it 
must be done, but seems worthwhile. CC [~rxin] and [~pwendell])

As pointed out on the mailing list, there are some Java files in the Scala 
source tree:

{code}
./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
./core/src/main/scala/org/apache/spark/annotation/Experimental.java
./core/src/main/scala/org/apache/spark/annotation/package-info.java
./core/src/main/scala/org/apache/spark/annotation/Private.java
./core/src/main/scala/org/apache/spark/api/java/package-info.java
./core/src/main/scala/org/apache/spark/broadcast/package-info.java
./core/src/main/scala/org/apache/spark/executor/package-info.java
./core/src/main/scala/org/apache/spark/io/package-info.java
./core/src/main/scala/org/apache/spark/rdd/package-info.java
./core/src/main/scala/org/apache/spark/scheduler/package-info.java
./core/src/main/scala/org/apache/spark/serializer/package-info.java
./core/src/main/scala/org/apache/spark/util/package-info.java
./core/src/main/scala/org/apache/spark/util/random/package-info.java
./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
./mllib/src/main/scala/org/apache/spark/ml/package-info.java
./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
{code}

It happens to work since the Scala compiler plugin is handling both.

On its face, they should be in the Java source tree. I'm trying to figure out 
if there are good reasons they have to be in this less intuitive location.

I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10538) java.lang.NegativeArraySizeException during join

2015-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742018#comment-14742018
 ] 

Maciej Bryński commented on SPARK-10538:


OK.

I managed to isolate the problem.

I have two dataframes:
1) Data dataframe
2) Dictionary dataframe

Counts of data group by foreign key to dictionary are following:
key, count
1, 5398567
2, 59912
3, 3678
4, 74461
5, 12845
When I did a join - result is partitioned by join key, so one of the partitions 
is too big to process.

Is there any possibility to force broadcast join from pyspark (or spark sql)?
I found this, but it's only for Scala. 
https://github.com/apache/spark/pull/6751/files



> java.lang.NegativeArraySizeException during join
> 
>
> Key: SPARK-10538
> URL: https://issues.apache.org/jira/browse/SPARK-10538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
> Attachments: screenshot-1.png
>
>
> Hi,
> I've got a problem during joining tables in PySpark. (in my example 20 of 
> them)
> I can observe that during calculation of first partition (on one of 
> consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) 
> vs on others partitions (approx. 272.5 KB / 113 record)
> I can also observe that just before the crash python process going up to few 
> gb of RAM.
> After some time there is an exception:
> {code}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm running this on 2 nodes cluster (12 cores, 64 GB RAM)
> Config:
> {code}
> spark.driver.memory  10g
> spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC 
> -Dfile.encoding=UTF8
> spark.executor.memory   60g
> spark.storage.memoryFraction0.05
> spark.shuffle.memoryFraction0.75
> spark.driver.maxResultSize  10g  
> spark.cores.max 24
> spark.kryoserializer.buffer.max 1g
> spark.default.parallelism   200
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9610) Class and instance weighting for ML

2015-09-12 Thread Nickolay Yakushev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742064#comment-14742064
 ] 

Nickolay Yakushev commented on SPARK-9610:
--

Thanks for reply.

> Class and instance weighting for ML
> ---
>
> Key: SPARK-9610
> URL: https://issues.apache.org/jira/browse/SPARK-9610
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This umbrella is for tracking tasks for adding support for label or instance 
> weights to ML algorithms.  These additions will help handle skewed or 
> imbalanced data, ensemble methods, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10576) Move .java files out of src/main/scala

2015-09-12 Thread Kiran Lonikar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742066#comment-14742066
 ] 

Kiran Lonikar edited comment on SPARK-10576 at 9/12/15 2:13 PM:


thats right, the intent is to find out if there is any particular reason to 
co-locate the java and scala files.

Additionally, there are instances of scala files in java source tree. Did not 
make an attempt to find all, but here is one:
core/src/main/java/org/apache/spark/api/java/function/package.scala



was (Author: klonikar):
thats right, the intent is to find out if there is any particular reason to 
colocate the java and scala files.

Additionally, there are instances of scala files in java source tree. Did not 
make an attempt to find all, but here is one:
core/src/main/java/org/apache/spark/api/java/function/package.scala


> Move .java files out of src/main/scala
> --
>
> Key: SPARK-10576
> URL: https://issues.apache.org/jira/browse/SPARK-10576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Priority: Minor
>
> (I suppose I'm really asking for an opinion on this, rather than asserting it 
> must be done, but seems worthwhile. CC [~rxin] and [~pwendell])
> As pointed out on the mailing list, there are some Java files in the Scala 
> source tree:
> {code}
> ./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
> ./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
> ./core/src/main/scala/org/apache/spark/annotation/Experimental.java
> ./core/src/main/scala/org/apache/spark/annotation/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/Private.java
> ./core/src/main/scala/org/apache/spark/api/java/package-info.java
> ./core/src/main/scala/org/apache/spark/broadcast/package-info.java
> ./core/src/main/scala/org/apache/spark/executor/package-info.java
> ./core/src/main/scala/org/apache/spark/io/package-info.java
> ./core/src/main/scala/org/apache/spark/rdd/package-info.java
> ./core/src/main/scala/org/apache/spark/scheduler/package-info.java
> ./core/src/main/scala/org/apache/spark/serializer/package-info.java
> ./core/src/main/scala/org/apache/spark/util/package-info.java
> ./core/src/main/scala/org/apache/spark/util/random/package-info.java
> ./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
> ./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
> ./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
> ./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
> ./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
> ./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
> ./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/package-info.java
> ./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
> ./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
> ./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
> {code}
> It happens to work since the Scala compiler plugin is handling both.
> On its face, they should be in the Java source tree. I'm trying to figure out 
> if there are good reasons they have to be in this less intuitive location.
> I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10576) Move .java files out of src/main/scala

2015-09-12 Thread Kiran Lonikar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742066#comment-14742066
 ] 

Kiran Lonikar commented on SPARK-10576:
---

thats right, the intent is to find out if there is any particular reason to 
colocate the java and scala files.

Additionally, there are instances of scala files in java source tree. Did not 
make an attempt to find all, but here is one:
core/src/main/java/org/apache/spark/api/java/function/package.scala


> Move .java files out of src/main/scala
> --
>
> Key: SPARK-10576
> URL: https://issues.apache.org/jira/browse/SPARK-10576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Priority: Minor
>
> (I suppose I'm really asking for an opinion on this, rather than asserting it 
> must be done, but seems worthwhile. CC [~rxin] and [~pwendell])
> As pointed out on the mailing list, there are some Java files in the Scala 
> source tree:
> {code}
> ./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
> ./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
> ./core/src/main/scala/org/apache/spark/annotation/Experimental.java
> ./core/src/main/scala/org/apache/spark/annotation/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/Private.java
> ./core/src/main/scala/org/apache/spark/api/java/package-info.java
> ./core/src/main/scala/org/apache/spark/broadcast/package-info.java
> ./core/src/main/scala/org/apache/spark/executor/package-info.java
> ./core/src/main/scala/org/apache/spark/io/package-info.java
> ./core/src/main/scala/org/apache/spark/rdd/package-info.java
> ./core/src/main/scala/org/apache/spark/scheduler/package-info.java
> ./core/src/main/scala/org/apache/spark/serializer/package-info.java
> ./core/src/main/scala/org/apache/spark/util/package-info.java
> ./core/src/main/scala/org/apache/spark/util/random/package-info.java
> ./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
> ./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
> ./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
> ./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
> ./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
> ./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
> ./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/package-info.java
> ./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
> ./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
> ./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
> {code}
> It happens to work since the Scala compiler plugin is handling both.
> On its face, they should be in the Java source tree. I'm trying to figure out 
> if there are good reasons they have to be in this less intuitive location.
> I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-09-12 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742070#comment-14742070
 ] 

Yadong Qi commented on SPARK-9213:
--

[~rxin] I'm working on this, and already have a pull request as you seen.

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9213) Improve regular expression performance (via joni)

2015-09-12 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742070#comment-14742070
 ] 

Yadong Qi edited comment on SPARK-9213 at 9/12/15 2:29 PM:
---

[~rxin] I'm working on this, and already created a pull request as you seen.


was (Author: waterman):
[~rxin] I'm working on this, and already have a pull request as you seen.

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9213) Improve regular expression performance (via joni)

2015-09-12 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742070#comment-14742070
 ] 

Yadong Qi edited comment on SPARK-9213 at 9/12/15 2:41 PM:
---

[~rxin] I'm working on this, and already created a pull request as you seen. 
Other people can review my code and fix the bug together.


was (Author: waterman):
[~rxin] I'm working on this, and already created a pull request as you seen.

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9213) Improve regular expression performance (via joni)

2015-09-12 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742070#comment-14742070
 ] 

Yadong Qi edited comment on SPARK-9213 at 9/12/15 2:42 PM:
---

[~rxin] I'm working on this, and already created a pull request as you seen. 
Other people can review my code and make it better together.


was (Author: waterman):
[~rxin] I'm working on this, and already created a pull request as you seen. 
Other people can review my code and fix the bug together.

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark, SQL] DataFrame hint for broadcast join

2015-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742036#comment-14742036
 ] 

Maciej Bryński commented on SPARK-10577:


[~rxin] 
I can find broadcast in functions.scala.

Is it possible to use it in SQL ?
select * from t1 join broadcast(t2) on t1.k1 = t2.k2 doesn't work.



> [PySpark, SQL] DataFrame hint for broadcast join
> 
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Spark SQL
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10577) [PySpark, SQL] DataFrame hint for broadcast join

2015-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742036#comment-14742036
 ] 

Maciej Bryński edited comment on SPARK-10577 at 9/12/15 12:29 PM:
--

[~rxin] 
I can find broadcast in functions.scala.

Is it possible to use it in SQL ?
select * from t1 join broadcast(t2) on t1.k1 = t2.k2 doesn't work.

{code}
Py4JJavaError: An error occurred while calling o31.sql.
: org.apache.spark.sql.AnalysisException: missing EOF at '(' near 'broadcast'; 
line 1 pos 41
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:296)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
{code}


was (Author: maver1ck):
[~rxin] 
I can find broadcast in functions.scala.

Is it possible to use it in SQL ?
select * from t1 join broadcast(t2) on t1.k1 = t2.k2 doesn't work.



> [PySpark, SQL] DataFrame hint for broadcast join
> 
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Spark SQL
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10577) [PySpark, SQL] DataFrame hint for broadcast join

2015-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742042#comment-14742042
 ] 

Maciej Bryński edited comment on SPARK-10577 at 9/12/15 12:47 PM:
--

Same without Hive support.
{code}
Py4JJavaError: An error occurred while calling o30.sql.
: java.lang.RuntimeException: [1.42] failure: ``union'' expected but `(' found
select * from t1 join broadcast(t2) on t1.k1 = t2.k2
   ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)

{code}


was (Author: maver1ck):
Same without Hive support.
Py4JJavaError: An error occurred while calling o30.sql.
{code}
: java.lang.RuntimeException: [1.42] failure: ``union'' expected but `(' found
select * from t1 join broadcast(t2) on t1.k1 = t2.k2
   ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)

{code}

> [PySpark, SQL] DataFrame hint for broadcast join
> 
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Spark SQL
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10222) More thoroughly deprecate Bagel in favor of GraphX

2015-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10222:


Assignee: Apache Spark  (was: Sean Owen)

> More thoroughly deprecate Bagel in favor of GraphX
> --
>
> Key: SPARK-10222
> URL: https://issues.apache.org/jira/browse/SPARK-10222
> Project: Spark
>  Issue Type: Task
>  Components: GraphX
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> It seems like Bagel has had little or no activity since before even Spark 1.0 
> (?) and is supposed to be superseded by GraphX. 
> Would it be reasonable to deprecate it for 1.6? and remove it in Spark 2.x? I 
> think it's reasonable enough that I'll assert this as a JIRA, but obviously 
> open to discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6350) Make mesosExecutorCores configurable in mesos "fine-grained" mode

2015-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742049#comment-14742049
 ] 

Apache Spark commented on SPARK-6350:
-

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/8732

> Make mesosExecutorCores configurable in mesos "fine-grained" mode
> -
>
> Key: SPARK-6350
> URL: https://issues.apache.org/jira/browse/SPARK-6350
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0, 1.5.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.4.0, 1.6.0
>
>
> When spark runs in mesos fine-grained mode, mesos slave launches executor 
> with # of cpus and memories. By the way, # of executor's cores is always 
> CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
> running intensive task, mesos executor always consume 5 cores without any 
> running task. This waste resources. We should set executor core as a 
> configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10577) [PySpark, SQL] DataFrame hint for broadcast join

2015-09-12 Thread JIRA
Maciej Bryński created SPARK-10577:
--

 Summary: [PySpark, SQL] DataFrame hint for broadcast join
 Key: SPARK-10577
 URL: https://issues.apache.org/jira/browse/SPARK-10577
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Maciej Bryński


As in https://issues.apache.org/jira/browse/SPARK-8300
there should by possibility to add hint for broadcast join in:
- Spark SQL
- Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10538) java.lang.NegativeArraySizeException during join

2015-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742018#comment-14742018
 ] 

Maciej Bryński edited comment on SPARK-10538 at 9/12/15 12:24 PM:
--

OK.

I managed to isolate the problem.

I have two dataframes:
1) Data dataframe
2) Dictionary dataframe

Counts of data group by foreign key to dictionary are following:
key, count
1, 5398567
2, 59912
3, 3678
4, 74461
5, 12845
When I did a join - result is partitioned by join key, so one of the partitions 
is too big to process.

As a workaround:
Is there any possibility to force broadcast join from pyspark (or spark sql)?
I found this, but it's only for Scala. 
https://github.com/apache/spark/pull/6751/files




was (Author: maver1ck):
OK.

I managed to isolate the problem.

I have two dataframes:
1) Data dataframe
2) Dictionary dataframe

Counts of data group by foreign key to dictionary are following:
key, count
1, 5398567
2, 59912
3, 3678
4, 74461
5, 12845
When I did a join - result is partitioned by join key, so one of the partitions 
is too big to process.

Is there any possibility to force broadcast join from pyspark (or spark sql)?
I found this, but it's only for Scala. 
https://github.com/apache/spark/pull/6751/files



> java.lang.NegativeArraySizeException during join
> 
>
> Key: SPARK-10538
> URL: https://issues.apache.org/jira/browse/SPARK-10538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
> Attachments: screenshot-1.png
>
>
> Hi,
> I've got a problem during joining tables in PySpark. (in my example 20 of 
> them)
> I can observe that during calculation of first partition (on one of 
> consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) 
> vs on others partitions (approx. 272.5 KB / 113 record)
> I can also observe that just before the crash python process going up to few 
> gb of RAM.
> After some time there is an exception:
> {code}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm running this on 2 nodes cluster (12 cores, 64 GB RAM)
> Config:
> {code}
> spark.driver.memory  10g
> spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC 
> -Dfile.encoding=UTF8
> spark.executor.memory   60g
> spark.storage.memoryFraction0.05
> spark.shuffle.memoryFraction0.75
> spark.driver.maxResultSize  10g  
> spark.cores.max 24
> spark.kryoserializer.buffer.max 1g
> spark.default.parallelism   200
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10577) [PySpark, SQL] DataFrame hint for broadcast join

2015-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742042#comment-14742042
 ] 

Maciej Bryński edited comment on SPARK-10577 at 9/12/15 12:46 PM:
--

Same without Hive support.
Py4JJavaError: An error occurred while calling o30.sql.
{code}
: java.lang.RuntimeException: [1.42] failure: ``union'' expected but `(' found
select * from t1 join broadcast(t2) on t1.k1 = t2.k2
   ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)

{code}


was (Author: maver1ck):
Same without Hive support.
Py4JJavaError: An error occurred while calling o30.sql.
{code}
: java.lang.RuntimeException: [1.42] failure: ``union'' expected but `(' found
select * from t1 join broadcast(t2) on t1.k1 = t2.k2
   ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)

{code}

> [PySpark, SQL] DataFrame hint for broadcast join
> 
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Spark SQL
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10577) [PySpark, SQL] DataFrame hint for broadcast join

2015-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742042#comment-14742042
 ] 

Maciej Bryński edited comment on SPARK-10577 at 9/12/15 12:46 PM:
--

Same without Hive support.
Py4JJavaError: An error occurred while calling o30.sql.
{code}
: java.lang.RuntimeException: [1.42] failure: ``union'' expected but `(' found
select * from t1 join broadcast(t2) on t1.k1 = t2.k2
   ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)

{code}


was (Author: maver1ck):
Same without Hive support.
Py4JJavaError: An error occurred while calling o30.sql.
{code}
: java.lang.RuntimeException: [1.42] failure: ``union'' expected but `(' found
select * from t1 join broadcast(t2) on t1.k1 = t2.k2
^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)

{code}

> [PySpark, SQL] DataFrame hint for broadcast join
> 
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Spark SQL
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10222) More thoroughly deprecate Bagel in favor of GraphX

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10222:
--
Assignee: Sean Owen
Target Version/s: 1.6.0  (was: 2+)
Priority: Minor  (was: Major)
 Summary: More thoroughly deprecate Bagel in favor of GraphX  (was: 
Deprecate, retire Bagel in favor of GraphX)

> More thoroughly deprecate Bagel in favor of GraphX
> --
>
> Key: SPARK-10222
> URL: https://issues.apache.org/jira/browse/SPARK-10222
> Project: Spark
>  Issue Type: Task
>  Components: GraphX
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> It seems like Bagel has had little or no activity since before even Spark 1.0 
> (?) and is supposed to be superseded by GraphX. 
> Would it be reasonable to deprecate it for 1.6? and remove it in Spark 2.x? I 
> think it's reasonable enough that I'll assert this as a JIRA, but obviously 
> open to discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark, SQL] DataFrame hint for broadcast join

2015-09-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742042#comment-14742042
 ] 

Maciej Bryński commented on SPARK-10577:


Same without Hive support.
Py4JJavaError: An error occurred while calling o30.sql.
{code}
: java.lang.RuntimeException: [1.42] failure: ``union'' expected but `(' found
select * from t1 join broadcast(t2) on t1.k1 = t2.k2
^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)

{code}

> [PySpark, SQL] DataFrame hint for broadcast join
> 
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Spark SQL
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10222) More thoroughly deprecate Bagel in favor of GraphX

2015-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742043#comment-14742043
 ] 

Apache Spark commented on SPARK-10222:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8731

> More thoroughly deprecate Bagel in favor of GraphX
> --
>
> Key: SPARK-10222
> URL: https://issues.apache.org/jira/browse/SPARK-10222
> Project: Spark
>  Issue Type: Task
>  Components: GraphX
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> It seems like Bagel has had little or no activity since before even Spark 1.0 
> (?) and is supposed to be superseded by GraphX. 
> Would it be reasonable to deprecate it for 1.6? and remove it in Spark 2.x? I 
> think it's reasonable enough that I'll assert this as a JIRA, but obviously 
> open to discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10222) More thoroughly deprecate Bagel in favor of GraphX

2015-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10222:


Assignee: Sean Owen  (was: Apache Spark)

> More thoroughly deprecate Bagel in favor of GraphX
> --
>
> Key: SPARK-10222
> URL: https://issues.apache.org/jira/browse/SPARK-10222
> Project: Spark
>  Issue Type: Task
>  Components: GraphX
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> It seems like Bagel has had little or no activity since before even Spark 1.0 
> (?) and is supposed to be superseded by GraphX. 
> Would it be reasonable to deprecate it for 1.6? and remove it in Spark 2.x? I 
> think it's reasonable enough that I'll assert this as a JIRA, but obviously 
> open to discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9014) Allow Python spark API to use built-in exponential operator

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9014:
-
Assignee: Alexey Grishchenko

> Allow Python spark API to use built-in exponential operator
> ---
>
> Key: SPARK-9014
> URL: https://issues.apache.org/jira/browse/SPARK-9014
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Jon Speiser
>Assignee: Alexey Grishchenko
>Priority: Minor
> Fix For: 1.6.0
>
>
> It would be nice if instead of saying:
> import pyspark.sql.functions as funcs
> df = df.withColumn("standarderror", funcs.sqrt(df["variance"]))
> ...if I could simply say:
> df = df.withColumn("standarderror", df["variance"] ** 0.5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7942) Receiver's life cycle is inconsistent with streaming job.

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7942:
-
Assignee: Tathagata Das

> Receiver's life cycle is inconsistent with streaming job.
> -
>
> Key: SPARK-7942
> URL: https://issues.apache.org/jira/browse/SPARK-7942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>Assignee: Tathagata Das
> Fix For: 1.5.0
>
>
> Streaming consider the receiver as a common spark job, thus if an error 
> occurs in the receiver's  logical(after 4 times(default) retries ), streaming 
> will no longer get any data but the streaming job is still running. 
> A general scenario is that: we config the 
> `spark.streaming.receiver.writeAheadLog.enable` as true to use the 
> `ReliableKafkaReceiver` but do not set the checkpoint dir. Then the receiver 
> will soon be shut down but the streaming is alive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10572) Investigate the contentions bewteen tasks in the same executor

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10572:
--
Component/s: Spark Core
 Scheduler

> Investigate the contentions bewteen tasks in the same executor
> --
>
> Key: SPARK-10572
> URL: https://issues.apache.org/jira/browse/SPARK-10572
> Project: Spark
>  Issue Type: Task
>  Components: Scheduler, Spark Core
>Reporter: Davies Liu
>
> According to the benchmark results Jesse F Chen, It's surprised to see there 
> are so much difference (4X) in term of number of executors, we should 
> investigate the reason.
> ```
> > Just be curious how the difference would be if you use 20 executors
> > and 20G memory for each executor..
> So I tried the following combinations:
> (GB X # executors)  (query response time in secs)
> 20X20 415
> 10X40 230
> 5X80  141
> 4X100 128
> 2X200 104
> CPU utilization is high so spreading more JVMs onto more vCores helps in this 
> case.
> For other workloads where memory utilization outweighs CPU, i can see larger 
> JVM
> sizes maybe more beneficial. It's for sure case-by-case.
> Seems overhead for codegen and scheduler overhead are negligible.
> ```
> https://www.mail-archive.com/user@spark.apache.org/msg36486.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4879:
-
Target Version/s: 1.3.0  (was: 1.0.3, 1.1.2, 1.2.2, 1.3.0)
  Labels:   (was: backport-needed)

I'm clearing "backport-needed" since it's virtually certain that there will be 
no more 1.2.x or earlier releases, and so the fix that was committed won't go 
back further at this point.

Is it something to leave open pending the ongoing conversation here? sounds 
like there may be more to the fix? 

> Missing output partitions after job completes with speculative execution
> 
>
> Key: SPARK-4879
> URL: https://issues.apache.org/jira/browse/SPARK-4879
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
> Attachments: speculation.txt, speculation2.txt
>
>
> When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
> save output files may report that they have completed successfully even 
> though some output partitions written by speculative tasks may be missing.
> h3. Reproduction
> This symptom was reported to me by a Spark user and I've been doing my own 
> investigation to try to come up with an in-house reproduction.
> I'm still working on a reliable local reproduction for this issue, which is a 
> little tricky because Spark won't schedule speculated tasks on the same host 
> as the original task, so you need an actual (or containerized) multi-host 
> cluster to test speculation.  Here's a simple reproduction of some of the 
> symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
> spark.speculation=true}}:
> {code}
> // Rig a job such that all but one of the tasks complete instantly
> // and one task runs for 20 seconds on its first attempt and instantly
> // on its second attempt:
> val numTasks = 100
> sc.parallelize(1 to numTasks, 
> numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =>
>   if (ctx.partitionId == 0) {  // If this is the one task that should run 
> really slow
> if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
>  Thread.sleep(20 * 1000)
> }
>   }
>   iter
> }.map(x => (x, x)).saveAsTextFile("/test4")
> {code}
> When I run this, I end up with a job that completes quickly (due to 
> speculation) but reports failures from the speculated task:
> {code}
> [...]
> 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
> 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
> (100/100)
> 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
> :22) finished in 0.856 s
> 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
> :22, took 0.885438374 s
> 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
> for 70.1 in stage 3.0 because task 70 has already completed successfully
> scala> 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
> stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
> java.io.IOException: Failed to save output of task: 
> attempt_201412110141_0003_m_49_413
> 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
> 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
> 
> org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> One interesting thing to note about this stack trace: if we look at 
> {{FileOutputCommitter.java:160}} 
> ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]),
>  this point in the execution seems to correspond to a case where a task 
> completes,