date:20161208

[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-12-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13747:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
> Fix For: 2.0.2, 2.1.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-12-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734585#comment-15734585
 ] 

Apache Spark commented on SPARK-13747:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16230

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.2, 2.1.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-12-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13747:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.2, 2.1.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-12-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-13747:
--

This issue still exists. Reopened it.

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.2, 2.1.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x

2016-12-08 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734560#comment-15734560
 ] 

Michael Allman commented on SPARK-18676:


Ah okay. That might be a strategy to explore.

> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---
>
> Key: SPARK-18676
> URL: https://issues.apache.org/jira/browse/SPARK-18676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in {{UnaryNode}} 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17076) Cardinality estimation of join operator

2016-12-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17076:


Assignee: Apache Spark

> Cardinality estimation of join operator
> ---
>
> Key: SPARK-17076
> URL: https://issues.apache.org/jira/browse/SPARK-17076
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Apache Spark
>
> support cardinality estimates for equi-join, Cartesian product join, and 
> outer join, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17076) Cardinality estimation of join operator

2016-12-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17076:


Assignee: (was: Apache Spark)

> Cardinality estimation of join operator
> ---
>
> Key: SPARK-17076
> URL: https://issues.apache.org/jira/browse/SPARK-17076
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> support cardinality estimates for equi-join, Cartesian product join, and 
> outer join, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17076) Cardinality estimation of join operator

2016-12-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734504#comment-15734504
 ] 

Apache Spark commented on SPARK-17076:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/16228

> Cardinality estimation of join operator
> ---
>
> Key: SPARK-17076
> URL: https://issues.apache.org/jira/browse/SPARK-17076
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> support cardinality estimates for equi-join, Cartesian product join, and 
> outer join, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10413) Model should support prediction on single instance

2016-12-08 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734495#comment-15734495
 ] 

Aseem Bansal edited comment on SPARK-10413 at 12/9/16 6:39 AM:
---

Hi

Is anyone working on this? And is there a JIRA ticket for having a predict 
method on PipelineModel?


was (Author: anshbansal):
Hi

Is anyone working on this?

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10413) Model should support prediction on single instance

2016-12-08 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734495#comment-15734495
 ] 

Aseem Bansal commented on SPARK-10413:
--

Hi

Is anyone working on this?

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18697) Upgrade sbt plugins

2016-12-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18697.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16223
[https://github.com/apache/spark/pull/16223]

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Weiqing Yang
>Priority: Trivial
> Fix For: 2.2.0
>
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18349) Update R API documentation on ml model summary

2016-12-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18349:
-
Fix Version/s: 2.1.1

> Update R API documentation on ml model summary
> --
>
> Key: SPARK-18349
> URL: https://issues.apache.org/jira/browse/SPARK-18349
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Miao Wang
> Fix For: 2.1.1
>
>
> It has been discovered that there is a fair bit of consistency in the 
> documentation of summary functions, eg.
> {code}
> #' @return \code{summary} returns a summary object of the fitted model, a 
> list of components
> #' including formula, number of features, list of features, feature 
> importances, number of
> #' trees, and tree weights
> setMethod("summary", signature(object = "GBTRegressionModel")
> {code}
> For instance, what should be listed for the return value? Should it be a name 
> or a phrase, or should it be a list of items; and should there be a longer 
> description on what they mean, or reference link to Scala doc.
> We will need to review this for all model summary implementations in mllib.R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18349) Update R API documentation on ml model summary

2016-12-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18349.
--
  Resolution: Fixed
Assignee: Miao Wang
Target Version/s: 2.1.1

> Update R API documentation on ml model summary
> --
>
> Key: SPARK-18349
> URL: https://issues.apache.org/jira/browse/SPARK-18349
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Miao Wang
>
> It has been discovered that there is a fair bit of consistency in the 
> documentation of summary functions, eg.
> {code}
> #' @return \code{summary} returns a summary object of the fitted model, a 
> list of components
> #' including formula, number of features, list of features, feature 
> importances, number of
> #' trees, and tree weights
> setMethod("summary", signature(object = "GBTRegressionModel")
> {code}
> For instance, what should be listed for the return value? Should it be a name 
> or a phrase, or should it be a list of items; and should there be a longer 
> description on what they mean, or reference link to Scala doc.
> We will need to review this for all model summary implementations in mllib.R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14932) Allow DataFrame.replace() to replace values with None

2016-12-08 Thread Bravo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734426#comment-15734426
 ] 

Bravo Zhang commented on SPARK-14932:
-

[~nchammas] Can your use case be done by filter?
df.filter(df.name == "")

> Allow DataFrame.replace() to replace values with None
> -
>
> Key: SPARK-14932
> URL: https://issues.apache.org/jira/browse/SPARK-14932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: starter
>
> Current doc: 
> http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> I would like to specify {{None}} as the value to substitute in. This is 
> currently 
> [disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145].
>  My use case is for replacing bad values with {{None}} so I can then ignore 
> them with {{dropna()}}.
> For example, I have a dataset that incorrectly includes empty strings where 
> there should be {{None}} values. I would like to replace the empty strings 
> with {{None}} and then drop all null data with {{dropna()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18788) Add getNumPartitions() to SparkR

2016-12-08 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734403#comment-15734403
 ] 

Felix Cheung commented on SPARK-18788:
--

In SparkR we don't officially support RDD - In which way would you suggest 
getNumPartitions be?

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14932) Allow DataFrame.replace() to replace values with None

2016-12-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14932:


Assignee: (was: Apache Spark)

> Allow DataFrame.replace() to replace values with None
> -
>
> Key: SPARK-14932
> URL: https://issues.apache.org/jira/browse/SPARK-14932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: starter
>
> Current doc: 
> http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> I would like to specify {{None}} as the value to substitute in. This is 
> currently 
> [disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145].
>  My use case is for replacing bad values with {{None}} so I can then ignore 
> them with {{dropna()}}.
> For example, I have a dataset that incorrectly includes empty strings where 
> there should be {{None}} values. I would like to replace the empty strings 
> with {{None}} and then drop all null data with {{dropna()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14932) Allow DataFrame.replace() to replace values with None

2016-12-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734376#comment-15734376
 ] 

Apache Spark commented on SPARK-14932:
--

User 'bravo-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16225

> Allow DataFrame.replace() to replace values with None
> -
>
> Key: SPARK-14932
> URL: https://issues.apache.org/jira/browse/SPARK-14932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: starter
>
> Current doc: 
> http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> I would like to specify {{None}} as the value to substitute in. This is 
> currently 
> [disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145].
>  My use case is for replacing bad values with {{None}} so I can then ignore 
> them with {{dropna()}}.
> For example, I have a dataset that incorrectly includes empty strings where 
> there should be {{None}} values. I would like to replace the empty strings 
> with {{None}} and then drop all null data with {{dropna()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14932) Allow DataFrame.replace() to replace values with None

2016-12-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14932:


Assignee: Apache Spark

> Allow DataFrame.replace() to replace values with None
> -
>
> Key: SPARK-14932
> URL: https://issues.apache.org/jira/browse/SPARK-14932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Current doc: 
> http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> I would like to specify {{None}} as the value to substitute in. This is 
> currently 
> [disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145].
>  My use case is for replacing bad values with {{None}} so I can then ignore 
> them with {{dropna()}}.
> For example, I have a dataset that incorrectly includes empty strings where 
> there should be {{None}} values. I would like to replace the empty strings 
> with {{None}} and then drop all null data with {{dropna()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-08 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734318#comment-15734318
 ] 

Takeshi Yamamuro commented on SPARK-18699:
--

yea, I'm also working on large csv files now and, certainly, I think this 
current behavior makes  it difficult to find incorrect records in them. Logging 
with meaningful waning messages (e.g., including the incorrect records and line 
numbers) helps much to me.

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM

2016-12-08 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718580#comment-15718580
 ] 

Li Yuanjian edited comment on SPARK-18700 at 12/9/16 4:47 AM:
--

Give a PR for this, add StripedLock for each table's relation in cache, not for 
the whole cachedDataSourceTables.



was (Author: xuanyuan):
I'll add PR for this soon, add ReadWriteLock for each table's relation in 
cache, not for the whole cachedDataSourceTables.


> getCached in HiveMetastoreCatalog not thread safe cause driver OOM
> --
>
> Key: SPARK-18700
> URL: https://issues.apache.org/jira/browse/SPARK-18700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Li Yuanjian
>
> In our spark sql platform, each query use same HiveContext and 
> independent thread, new data will append to tables as new partitions every 
> 30min. After a new partition added to table T, we should call refreshTable to 
> clear T’s cache in cachedDataSourceTables to make the new partition 
> searchable. 
> For the table have more partitions and files(much bigger than 
> spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table 
> T will start a job to fetch all FileStatus in listLeafFiles function. Because 
> of the huge number of files, the job will run several seconds, during the 
> time, new queries of table T will also start new jobs to fetch FileStatus 
> because of the function of getCache is not thread safe. Final cause a driver 
> OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-12-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734207#comment-15734207
 ] 

Saisai Shao commented on SPARK-13955:
-

Can you please check the runtime environment of launching container, it should 
lie in NM's local dir. 

{code}
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}/container_${contid}
{code}

When NM brings up container, it will create a container specific folder and put 
all the dependencies, files,. etc to that folder, also including launching 
script. You could check whether classpath is correct, or if archive is found 
there.

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-12-08 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734195#comment-15734195
 ] 

liyunzhang_intel commented on SPARK-13955:
--

[~jerryshao]: yes , the archive contains spark-yarn_2.11 jar

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-12-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734184#comment-15734184
 ] 

Saisai Shao commented on SPARK-13955:
-

Do you have spark-yarn_2.11 jar in your archive?

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext

2016-12-08 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734150#comment-15734150
 ] 

Dongjoon Hyun commented on SPARK-11374:
---

For this issue, there is a discussion now on the PR. It seems that we can make 
a decision now, YES(Resolved) or NO(Wont Fix).

> skip.header.line.count is ignored in HiveContext
> 
>
> Key: SPARK-11374
> URL: https://issues.apache.org/jira/browse/SPARK-11374
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Daniel Haviv
>
> csv table in Hive which is configured to skip the header row using 
> TBLPROPERTIES("skip.header.line.count"="1").
> When querying from Hive the header row is not included in the data, but when 
> running the same query via HiveContext I get the header row.
> "show create table " via the HiveContext confirms that it is aware of the 
> setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734130#comment-15734130
 ] 

Hyukjin Kwon commented on SPARK-18799:
--

Ah, there it is - https://github.com/apache/spark/pull/10801

> Spark SQL expose interface for plug-gable parser extension 
> ---
>
> Key: SPARK-18799
> URL: https://issues.apache.org/jira/browse/SPARK-18799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jihong MA
>
> There used to be an interface to plug a parser extension through 
> ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x 
> release, Apache Spark moved to the new parser (Antlr4), there is no longer a 
> way to extend the default SQL parser through SparkSession interface, however 
> this is really a pain and hard to work around it when integrating other data 
> source with Spark with extended support such as Insert, Update, Delete 
> statement or any other data management statement. 
> It would be very nice to continue to expose an interface for parser extension 
> to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734128#comment-15734128
 ] 

Hyukjin Kwon commented on SPARK-18799:
--

Less than roughly about a year ago, I saw a PR that says Spark does not support 
plug-gable parser and it was removed to prevent illusion of supporting this 
while tracking down the code path about SQL parser *If I remember this 
correctly*. cc [~rxin] I hope it was not my wrong memory manipulated by myself.

> Spark SQL expose interface for plug-gable parser extension 
> ---
>
> Key: SPARK-18799
> URL: https://issues.apache.org/jira/browse/SPARK-18799
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jihong MA
>
> There used to be an interface to plug a parser extension through 
> ParserDialect in HiveContext in all Spark 1.x version. Starting Spark 2.x 
> release, Apache Spark moved to the new parser (Antlr4), there is no longer a 
> way to extend the default SQL parser through SparkSession interface, however 
> this is really a pain and hard to work around it when integrating other data 
> source with Spark with extended support such as Insert, Update, Delete 
> statement or any other data management statement. 
> It would be very nice to continue to expose an interface for parser extension 
> to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-12-08 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734082#comment-15734082
 ] 

liyunzhang_intel commented on SPARK-13955:
--

test pi in yarn-client mode by using "spark.yarn.archive"
[~jerryshao]: following is the detail steps when i use "spark.yarn.archive"
1. zip all jars:  zip spark-archive.zip $SPARK_HOME/jars/*
2. upload the zip to hdfs: hadoop fs -copyFromLocal spark-archive.zip 
hdfs://bdpe42:8020/
3. modify the spark-defaults.conf
   spark.yarn.archive=hdfs://bdpe42:8020/spark-archive.zip
4. run pi in yarn client mode
{code}
   ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-client --num-executors 3 --driver-memory 1g --executor-memory 1g   
  --executor-cores 1   $spark_example_jar > sparkPi.log 2>&1 
{code}

The exception in container log is
{code}
Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher
{code}

The spark version is  2.0.2.


> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13955) Spark in yarn mode fails

2016-12-08 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734082#comment-15734082
 ] 

liyunzhang_intel edited comment on SPARK-13955 at 12/9/16 2:40 AM:
---

[~jerryshao]: following is the detail steps when i use "spark.yarn.archive"
1. zip all jars:  zip spark-archive.zip $SPARK_HOME/jars/*
2. upload the zip to hdfs: hadoop fs -copyFromLocal spark-archive.zip 
hdfs://bdpe42:8020/
3. modify the spark-defaults.conf
   spark.yarn.archive=hdfs://bdpe42:8020/spark-archive.zip
4. run pi in yarn client mode
{code}
   ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-client --num-executors 3 --driver-memory 1g --executor-memory 1g   
  --executor-cores 1   $spark_example_jar > sparkPi.log 2>&1 
{code}

The exception in container log is
{code}
Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher
{code}

The spark version is  2.0.2.



was (Author: kellyzly):
test pi in yarn-client mode by using "spark.yarn.archive"
[~jerryshao]: following is the detail steps when i use "spark.yarn.archive"
1. zip all jars:  zip spark-archive.zip $SPARK_HOME/jars/*
2. upload the zip to hdfs: hadoop fs -copyFromLocal spark-archive.zip 
hdfs://bdpe42:8020/
3. modify the spark-defaults.conf
   spark.yarn.archive=hdfs://bdpe42:8020/spark-archive.zip
4. run pi in yarn client mode
{code}
   ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-client --num-executors 3 --driver-memory 1g --executor-memory 1g   
  --executor-cores 1   $spark_example_jar > sparkPi.log 2>&1 
{code}

The exception in container log is
{code}
Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher
{code}

The spark version is  2.0.2.


> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-12-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734071#comment-15734071
 ] 

Saisai Shao commented on SPARK-13955:
-

IIRC {{spark.yarn.archive}} should be worked, I tried personally in my local 
machine, also our HDP distribution by default configured it.

If you want to use "spark.yarn.archive", you should zip all the Spark run-time 
required jars, put this archive either locally or on hdfs and configured the 
path for "spark.yarn.archive". Then yarn will add it to distributed cache.

Can you please tell how you configured and the error you met?

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17076) Cardinality estimation of join operator

2016-12-08 Thread Ron Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-17076:
---
Comment: was deleted

(was: Hi,

I am out of office 11 /15 through 11/18 with very limited Internet access.  My 
reply to your email will be delayed.  Thanks.

Best,
Ron Hu

)

> Cardinality estimation of join operator
> ---
>
> Key: SPARK-17076
> URL: https://issues.apache.org/jira/browse/SPARK-17076
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> support cardinality estimates for equi-join, Cartesian product join, and 
> outer join, etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-08 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734033#comment-15734033
 ] 

Reynold Xin commented on SPARK-18278:
-

In the past few days I've given this a lot of thought.

I'm personally very interested in this work, and would actually use it myself. 
That said, based on my experience, the real work starts after the initial thing 
works, i.e. the maintenance and enhancement work in the future will be much 
larger than the initial commit. Adding another officially supported scheduler 
definitely has some serious (and maybe disruptive) impacts to Spark. Some 
examples are ...

1. Testing becomes more complicated.
2. Related to 1, releases become more likely to be delayed. In the past many 
Spark releases were delayed due to bugs in Mesos integration or the YARN 
integration, because those are harder to be tested reliably in an automated 
fashion.
3. The release process has to change.

Given Kubernetes is still very young, and unclear how successful it will be in 
the future (I personally think it will be, but you never know), I would make 
the following, concrete recommendations on moving this forward:

1. See if we can implement this as an add-on (library) outside Spark If not 
possible, what about a fork?
2. Publish some non-official docker images so it is easy to use Spark on 
Kubernetes this way.
3. Encourage users to use it and get feedback. Have the contributors that are 
really interested in this work maintain it for couple Spark releases (this 
includes testing the implementation, publishing new docker images, writing 
documentations).
4. Evaluate later (say 2 releases) how well this has been received on whether 
we take a coordinated effort to merge this into Spark, since it might become 
the most popular cluster manager.



> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2016-12-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734028#comment-15734028
 ] 

Sean Owen commented on SPARK-18750:
---

Yes, so the basic question is: where is the error coming from? Is it Spark? May 
be simplistic question but not sure where to look for a solution and I have the 
impression the reporter knows.

> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
>

[jira] [Resolved] (SPARK-18792) SparkR vignette update: logit

2016-12-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18792.
---
Resolution: Duplicate

> SparkR vignette update: logit
> -
>
> Key: SPARK-18792
> URL: https://issues.apache.org/jira/browse/SPARK-18792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover logit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18792) SparkR vignette update: logit

2016-12-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734024#comment-15734024
 ] 

Xiangrui Meng commented on SPARK-18792:
---

[~wangmiao1981] Please check existing sub-tasks before creating new ones. I'm 
closing mine.

> SparkR vignette update: logit
> -
>
> Key: SPARK-18792
> URL: https://issues.apache.org/jira/browse/SPARK-18792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover logit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18792) SparkR vignette update: logit

2016-12-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734013#comment-15734013
 ] 

Apache Spark commented on SPARK-18792:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/16224

> SparkR vignette update: logit
> -
>
> Key: SPARK-18792
> URL: https://issues.apache.org/jira/browse/SPARK-18792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover logit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18774) Ignore non-existing files when ignoreCorruptFiles is enabled

2016-12-08 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18774:

Fix Version/s: 2.1.1

> Ignore non-existing files when ignoreCorruptFiles is enabled
> 
>
> Key: SPARK-18774
> URL: https://issues.apache.org/jira/browse/SPARK-18774
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18799) Spark SQL expose interface for plug-gable parser extension

2016-12-08 Thread Jihong MA (JIRA)

Jihong MA created SPARK-18799:
-

 Summary: Spark SQL expose interface for plug-gable parser 
extension 
 Key: SPARK-18799
 URL: https://issues.apache.org/jira/browse/SPARK-18799
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Jihong MA


There used to be an interface to plug a parser extension through ParserDialect 
in HiveContext in all Spark 1.x version. Starting Spark 2.x release, Apache 
Spark moved to the new parser (Antlr4), there is no longer a way to extend the 
default SQL parser through SparkSession interface, however this is really a 
pain and hard to work around it when integrating other data source with Spark 
with extended support such as Insert, Update, Delete statement or any other 
data management statement. 

It would be very nice to continue to expose an interface for parser extension 
to make data source integration easier and smoother. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18776) Offset for FileStreamSource is not json formatted

2016-12-08 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18776.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16205
[https://github.com/apache/spark/pull/16205]

> Offset for FileStreamSource is not json formatted
> -
>
> Key: SPARK-18776
> URL: https://issues.apache.org/jira/browse/SPARK-18776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 2.1.0
>
>
> All source offset must be json formatted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-12-08 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733992#comment-15733992
 ] 

Dongjoon Hyun commented on SPARK-18642:
---

I see. If then, I'll record that, too.

> Spark SQL: Catalyst is scanning undesired columns
> -
>
> Key: SPARK-18642
> URL: https://issues.apache.org/jira/browse/SPARK-18642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
> Environment: Ubuntu 14.04
> Spark: Local Mode
>Reporter: Mohit
>  Labels: performance
> Fix For: 2.0.0
>
>
> When doing a left-join between two tables, say A and B,  Catalyst has 
> information about the projection required for table B. Only the required 
> columns should be scanned.
> Code snippet below explains the scenario:
> scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
> dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]
> scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
> dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]
> scala> dfA.registerTempTable("A")
> scala> dfB.registerTempTable("B")
> scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
> where B.bid<2").explain
> == Physical Plan ==
> Project [aid#15,bid#17]
> +- Filter (bid#17 < 2)
>+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
>   :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: 
> file:/home/mohit/ruleA
>   +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: 
> file:/home/mohit/ruleB
> This is a watered-down example from a production issue which has a huge 
> performance impact.
> External reference: 
> http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.

2016-12-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17689:
-
Target Version/s: 2.2.0
 Description: 
Steps to reproduce:

1) Start a streaming job which reads from HDFS location hdfs://xyz/*
2) Write content to hdfs://xyz/a
.
.
repeat a few times.

And then job breaks as follows.


org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in 
stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage 304.0 
(TID 14794, localhost): java.io.FileNotFoundException: File does not exist: 
hdfs://localhost:9000/input/t5/_temporary
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


  was:

Steps to reproduce:

1) Start a streaming job which reads from HDFS location hdfs://xyz/*
2) Write content to hdfs://xyz/a
.
.
repeat a few times.

And then job breaks as follows.


org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in 
stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage 304.0 
(TID 14794, localhost): java.io.FileNotFoundException: File does not exist: 
hdfs://localhost:9000/input/t5/_temporary
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464)
at 
org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at

[jira] [Updated] (SPARK-18272) Test topic addition for subscribePattern on Kafka DStream and Structured Stream

2016-12-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18272:
-
Issue Type: Test  (was: Bug)

> Test topic addition for subscribePattern on Kafka DStream and Structured 
> Stream
> ---
>
> Key: SPARK-18272
> URL: https://issues.apache.org/jira/browse/SPARK-18272
> Project: Spark
>  Issue Type: Test
>  Components: DStreams, Structured Streaming
>Reporter: Cody Koeninger
>
> We've had reports of the following sequence
> - create subscribePattern stream that doesn't match any existing topics at 
> the time stream starts
> - add a topic that matches pattern
> - expect that messages from that topic show up, but they don't
> We don't seem to actually have tests that cover this case, so we should add 
> them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18790) Keep a general offset history of stream batches

2016-12-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18790:
-
Target Version/s: 2.1.0

> Keep a general offset history of stream batches
> ---
>
> Key: SPARK-18790
> URL: https://issues.apache.org/jira/browse/SPARK-18790
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tyson Condie
>
> Instead of only keeping the minimum number of offsets around, we should keep 
> enough information to allow us to roll back n batches and reexecute the 
> stream starting from a given point. In particular, we should create a config 
> in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and 
> ensure that we keep enough log files in the following places to roll back the 
> specified number of batches:
> the offsets that are present in each batch
> versions of the state store
> the files lists stored for the FileStreamSource
> the metadata log stored by the FileStreamSink



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18796) StreamingQueryManager should not hold a lock when starting a query

2016-12-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18796:
-
Target Version/s: 2.1.0

> StreamingQueryManager should not hold a lock when starting a query
> --
>
> Key: SPARK-18796
> URL: https://issues.apache.org/jira/browse/SPARK-18796
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>
> Otherwise, the user cannot start any queries when a query is starting. If a 
> query takes a long time to start, the user experience will be pretty bad.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18787) spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer usage by Netty

2016-12-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733902#comment-15733902
 ] 

Sean Owen commented on SPARK-18787:
---

CC [~zsxwing] 

The tricky thing is that I think these classes may load before we know whether 
this config is enabled?

> spark.shuffle.io.preferDirectBufs does not completely turn off direct buffer 
> usage by Netty
> ---
>
> Key: SPARK-18787
> URL: https://issues.apache.org/jira/browse/SPARK-18787
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Aniket Bhatnagar
>
> The documentation for the configuration spark.shuffle.io.preferDirectBufs 
> states that it will force all allocations from Netty to be on-heap but this 
> currently does not happen. The reason is that preferDirect of Netty's 
> PooledByteBufAllocator doesn't completely eliminate use of off heap by Netty. 
> In order to completely stop netty from using off heap memory, we need to set 
> the following system properties:
> - io.netty.noUnsafe=true
> - io.netty.threadLocalDirectBufferSize=0
> The proposal is to set properties (using System.setProperties) when the 
> executor starts (before any of the Netty classes load) or document these 
> properties to hint users on how to completely eliminate off Netty' heap 
> footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2016-12-08 Thread Yibing Shi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733892#comment-15733892
 ] 

Yibing Shi edited comment on SPARK-18750 at 12/9/16 12:59 AM:
--

[~srowen] The log says:
{noformat}
16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 time(s) 
in a row.
java.lang.StackOverflowError
{noformat}

It looks like that number of executors is too high for reporter thread to deal 
with.



was (Author: yibing):
[~srowen] The log says:
{noformat}
16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 time(s) 
in a row.
java.lang.StackOverflowError
{noformat}

The looks like that number of executors is too high for reporter thread to deal 
with.


> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
>

[jira] [Commented] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2016-12-08 Thread Yibing Shi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733892#comment-15733892
 ] 

Yibing Shi commented on SPARK-18750:


[~srowen] The log says:
{noformat}
16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 time(s) 
in a row.
java.lang.StackOverflowError
{noformat}

The looks like that number of executors is too high for reporter thread to deal 
with.


> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
>

[jira] [Commented] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-12-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733886#comment-15733886
 ] 

Sean Owen commented on SPARK-18642:
---

[~dongjoon] ignore this if you don't know, but if you happen to know what issue 
fixed this, that would be great to record.

> Spark SQL: Catalyst is scanning undesired columns
> -
>
> Key: SPARK-18642
> URL: https://issues.apache.org/jira/browse/SPARK-18642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
> Environment: Ubuntu 14.04
> Spark: Local Mode
>Reporter: Mohit
>  Labels: performance
> Fix For: 2.0.0
>
>
> When doing a left-join between two tables, say A and B,  Catalyst has 
> information about the projection required for table B. Only the required 
> columns should be scanned.
> Code snippet below explains the scenario:
> scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
> dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]
> scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
> dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]
> scala> dfA.registerTempTable("A")
> scala> dfB.registerTempTable("B")
> scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
> where B.bid<2").explain
> == Physical Plan ==
> Project [aid#15,bid#17]
> +- Filter (bid#17 < 2)
>+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
>   :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: 
> file:/home/mohit/ruleA
>   +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: 
> file:/home/mohit/ruleB
> This is a watered-down example from a production issue which has a huge 
> performance impact.
> External reference: 
> http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18718) Skip some test failures due to path length limitation and fix tests to pass on Windows

2016-12-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18718:
--
Assignee: Hyukjin Kwon

> Skip some test failures due to path length limitation and fix tests to pass 
> on Windows
> --
>
> Key: SPARK-18718
> URL: https://issues.apache.org/jira/browse/SPARK-18718
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> There are some tests failed on Windows due to the wrong format of path and 
> the limitation of path length as below:
> - {{InsertSuite}}
> {code}
>   Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.sql.sources.InsertSuite *** ABORTED *** (12 seconds, 547 
> milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget   
> mpspark-177945ef-9128-42b4-8c07-de31f78bbbd6;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> - {{BroadcastJoinSuite}}
> {code}
>   04:09:40.882 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error 
> running executor
>   java.io.IOException: Cannot run program 
> "C:\Progra~1\Java\jdk1.8.0\bin\java" (in directory 
> "C:\projects\spark\work\app-20161205040542-\51658"): CreateProcess 
> error=206, The filename or extension is too long
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>   at 
> org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
>   at 
> org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
>   Caused by: java.io.IOException: CreateProcess error=206, The filename or 
> extension is too long
>   at java.lang.ProcessImpl.create(Native Method)
>   at java.lang.ProcessImpl.(ProcessImpl.java:386)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:137)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   ... 2 more
>   04:09:40.929 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error 
> running executor
> (appearently infinite same error messages)
>   
>   ...
> {code}
> - {{PathOptionSuite}}
> {code}
>   - path option also exist for write path *** FAILED *** (1 second, 93 
> milliseconds)
> "C:[projectsspark arget   mp]spark-5ab34a58-df8d-..." did not equal 
> "C:[\projects\spark\target\tmp\]spark-5ab34a58-df8d-..." 
> (PathOptionSuite.scala:93)
> org.scalatest.exceptions.TestFailedException:
> at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
> at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> ...
> {code}
> - {{SparkLauncherSuite}}
> {code}
>   Test org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher 
> failed: java.lang.AssertionError: expected:<0> but was:<1>, took 0.062 sec
> at 
> org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177)
>   ...
> {code}
> - {{UDFSuite}}
> {code}
>   - SPARK-8005 input_file_name *** FAILED *** (2 seconds, 234 milliseconds)
> 
> "file:///C:/projects/spark/target/tmp/spark-e4e5720a-2006-48f9-8b11-797bf59794bf/part-1-26fb05e4-603d-471d-ae9d-b9549e0c7765.snappy.parquet"
>  did not contain 
> "C:\projects\spark\target\tmp\spark-e4e5720a-2006-48f9-8b11-797bf59794bf" 
> (UDFSuite.scala:67)
> org.scalatest.exceptions.TestFailedException:
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
> ...
> {code}
> This JIRA will complete SPARK-17591 for now because I could proceed further 
> more tests on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18615) Switch to multi-line doc to avoid a genjavadoc bug for backticks

2016-12-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18615:
--
Assignee: Hyukjin Kwon

> Switch to multi-line doc to avoid a genjavadoc bug for backticks
> 
>
> Key: SPARK-18615
> URL: https://issues.apache.org/jira/browse/SPARK-18615
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> I suspect this is related with SPARK-16153 and genjavadoc issue in 
> https://github.com/typesafehub/genjavadoc/issues/85 but I am not too sure.
> Currently, single line comment does not mark down backticks to 
> {{..}} but prints as they are. For example, the line below:
> {code}
> /** Return an RDD with the pairs from `this` whose keys are not in `other`. */
> {code}
> So, we could work around this as below:
> {code}
> /**
>  * Return an RDD with the pairs from `this` whose keys are not in `other`.
>  */
> {code}
> Please refer the image in the pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18758) StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query

2016-12-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18758:
--
Assignee: Tathagata Das

> StreamingQueryListener events from a StreamingQuery should be sent only to 
> the listeners in the same session as the query
> -
>
> Key: SPARK-18758
> URL: https://issues.apache.org/jira/browse/SPARK-18758
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 2.1.0
>
>
> Listeners added with `sparkSession.streams.addListener(l)` are added to a 
> SparkSession. So events only from queries in the same session as a listener 
> should be posted to the listener.
> Currently, all the events gets routed through the Spark's main listener bus, 
> and therefore all StreamingQueryListener events gets posted to 
> StreamingQueryListeners in all sessions. This is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18798) Expose the kill Executor in Yarn Mode

2016-12-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733860#comment-15733860
 ] 

Marcelo Vanzin commented on SPARK-18798:


Can you explain what you mean here?

The API to kill executors exists in SparkContext and definitely works on YARN 
(it was actually the first place where it was available).

> Expose the kill Executor in Yarn Mode
> -
>
> Key: SPARK-18798
> URL: https://issues.apache.org/jira/browse/SPARK-18798
> Project: Spark
>  Issue Type: Improvement
>Reporter: Narendra
>
> Expose the kill Executor in Yarn Mode
> I can see spark already has exposed the kill Executor method through spark 
> context  for Mesos , fi spark can expose the same method for Yarn it will 
> good feature if some want to test application stability by killing randomly 
> Executor
> I see spark have kill Executor in YarnAllocator, it won't we much time 
> consuming to expose this , anyone can work I can also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17859) persist should not impede with spark's ability to perform a broadcast join.

2016-12-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17859.
---
   Resolution: Cannot Reproduce
Fix Version/s: 2.0.2

> persist should not impede with spark's ability to perform a broadcast join.
> ---
>
> Key: SPARK-17859
> URL: https://issues.apache.org/jira/browse/SPARK-17859
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.0
> Environment: spark 2.0.0 , Linux RedHat
>Reporter: Franck Tago
> Fix For: 2.0.2
>
>
> I am using Spark 2.0.0 
> My investigation leads me to conclude that calling persist could prevent 
> broadcast join  from happening .
> Example
> Case1: No persist call 
> var  df1 =spark.range(100).select($"id".as("id1"))
> df1: org.apache.spark.sql.DataFrame = [id1: bigint]
>  var df2 =spark.range(1000).select($"id".as("id2"))
> df2: org.apache.spark.sql.DataFrame = [id2: bigint]
>  df1.join(df2 , $"id1" === $"id2" ).explain 
> == Physical Plan ==
> *BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight
> :- *Project [id#114L AS id1#117L]
> :  +- *Range (0, 100, splits=2)
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>+- *Project [id#120L AS id2#123L]
>   +- *Range (0, 1000, splits=2)
> Case 2:  persist call 
>  df1.persist.join(df2 , $"id1" === $"id2" ).explain 
> 16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data.
> == Physical Plan ==
> *SortMergeJoin [id1#3L], [id2#9L], Inner
> :- *Sort [id1#3L ASC], false, 0
> :  +- Exchange hashpartitioning(id1#3L, 10)
> : +- InMemoryTableScan [id1#3L]
> ::  +- InMemoryRelation [id1#3L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> :: :  +- *Project [id#0L AS id1#3L]
> :: : +- *Range (0, 100, splits=2)
> +- *Sort [id2#9L ASC], false, 0
>+- Exchange hashpartitioning(id2#9L, 10)
>   +- InMemoryTableScan [id2#9L]
>  :  +- InMemoryRelation [id2#9L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>  : :  +- *Project [id#6L AS id2#9L]
>  : : +- *Range (0, 1000, splits=2)
> Why does the persist call prevent the broadcast join . 
> My opinion is that it should not .
> I was made aware that the persist call is  lazy and that might have something 
> to do with it , but I still contend that it should not . 
> Losing broadcast joins is really costly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted

2016-12-08 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733855#comment-15733855
 ] 

Takeshi Yamamuro commented on SPARK-18591:
--

yea, If we can, it's the best. But IIUC it's difficult to do that in the fist 
place because spark replaces logical nodes with physical ones in top-down way 
and we can't check the output ordering of children in between planning.

> Replace hash-based aggregates with sort-based ones if inputs already sorted
> ---
>
> Key: SPARK-18591
> URL: https://issues.apache.org/jira/browse/SPARK-18591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently uses sort-based aggregates only in limited condition; the 
> cases where spark cannot use partial aggregates and hash-based ones.
> However, if input ordering has already satisfied the requirements of 
> sort-based aggregates, it seems sort-based ones are faster than the other.
> {code}
> ./bin/spark-shell --conf spark.sql.shuffle.partitions=1
> val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS 
> value").sort($"key").cache
> def timer[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   val t1 = System.nanoTime()
>   println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s")
>   result
> }
> timer {
>   df.groupBy("key").count().count
> }
> // codegen'd hash aggregate
> Elapsed time: 7.116962977s
> // non-codegen'd sort aggregarte
> Elapsed time: 3.088816662s
> {code}
> If codegen'd sort-based aggregates are supported in SPARK-16844, this seems 
> to make the performance gap bigger;
> {code}
> - codegen'd sort aggregate
> Elapsed time: 1.645234684s
> {code} 
> Therefore, it'd be better to use sort-based ones in this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18770) Current Spark Master branch missing yarn module in pom

2016-12-08 Thread Narendra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733834#comment-15733834
 ] 

Narendra commented on SPARK-18770:
--

if even i have close this is available in main pom 

> Current Spark Master branch missing yarn module in pom
> --
>
> Key: SPARK-18770
> URL: https://issues.apache.org/jira/browse/SPARK-18770
> Project: Spark
>  Issue Type: Bug
>Reporter: Narendra
>Priority: Minor
>
> Current Spark Master branch missing yarn module in pom , because of if some 
> trying build is not able to build locally 
> I have added that module in pom because yarn 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733832#comment-15733832
 ] 

Sean Owen commented on SPARK-9487:
--

(Which thread?) I think that if you can get all tests in one language updated 
uniformly to pass with a different number of threads, that would be a 
sufficient unit of work to commit. I know it's not small. Or, maybe even a 
couple modules along with associated test improvements that make them robust to 
the number of threads. If you can get a significant logical chunk of 
improvement working we can commit it as a step towards a resolution.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18798) Expose the kill Executor in Yarn Mode

2016-12-08 Thread Narendra (JIRA)

Narendra created SPARK-18798:


 Summary: Expose the kill Executor in Yarn Mode
 Key: SPARK-18798
 URL: https://issues.apache.org/jira/browse/SPARK-18798
 Project: Spark
  Issue Type: Improvement
Reporter: Narendra


Expose the kill Executor in Yarn Mode

I can see spark already has exposed the kill Executor method through spark 
context  for Mesos , fi spark can expose the same method for Yarn it will good 
feature if some want to test application stability by killing randomly Executor
I see spark have kill Executor in YarnAllocator, it won't we much time 
consuming to expose this , anyone can work I can also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates

2016-12-08 Thread Miao Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miao Wang updated SPARK-18332:
--
Comment: was deleted

(was: Update spark.logit is part of the QA work.)

> SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates

2016-12-08 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733785#comment-15733785
 ] 

Miao Wang commented on SPARK-18332:
---

[~josephkb]

https://github.com/apache/spark/pull/16222

This PR update spark.logit.

In this PR, I will also clean up what you mentioned here.

Thanks!

> SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18795) SparkR vignette update: ksTest

2016-12-08 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733777#comment-15733777
 ] 

Miao Wang commented on SPARK-18795:
---

I will work on this one too.

Thanks!

Miao

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18792) SparkR vignette update: logit

2016-12-08 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733775#comment-15733775
 ] 

Miao Wang commented on SPARK-18792:
---

I have submitted PR for JIRA-18797, which is the same as this one.

> SparkR vignette update: logit
> -
>
> Key: SPARK-18792
> URL: https://issues.apache.org/jira/browse/SPARK-18792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover logit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates

2016-12-08 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733771#comment-15733771
 ] 

Miao Wang commented on SPARK-18332:
---

Update spark.logit is part of the QA work.

> SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18697) Upgrade sbt plugins

2016-12-08 Thread Weiqing Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-18697:
-
Description: 
For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
plugins will be upgraded:
{code}
sbteclipse-plugin: 4.0.0 -> 5.0.1
sbt-mima-plugin: 0.1.11 -> 0.1.12
org.ow2.asm/asm: 5.0.3 -> 5.1 
org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
{code}

  was:
For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
plugins will be upgraded:
{code}
sbteclipse-plugin: 4.0.0 -> 5.0.1
sbt-mima-plugin: 0.1.11 -> 0.1.12
org.ow2.asm/asm: 5.0.3 -> 5.1 
org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
{code}

All other plugins are up-to-date. 


> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18697) Upgrade sbt plugins

2016-12-08 Thread Weiqing Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-18697:
-
Description: 
For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
plugins will be upgraded:
{code}
sbteclipse-plugin: 4.0.0 -> 5.0.1
sbt-mima-plugin: 0.1.11 -> 0.1.12
org.ow2.asm/asm: 5.0.3 -> 5.1 
org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
{code}

All other plugins are up-to-date. 

  was:
For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
plugins will be upgraded:
{code}
sbt-assembly: 0.11.2 -> 0.14.3
sbteclipse-plugin: 4.0.0 -> 5.0.1
sbt-mima-plugin: 0.1.11 -> 0.1.12
org.ow2.asm/asm: 5.0.3 -> 5.1 
org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
{code}

All other plugins are up-to-date. 


> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18792) SparkR vignette update: logit

2016-12-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18792:
-

Assignee: Xiangrui Meng

> SparkR vignette update: logit
> -
>
> Key: SPARK-18792
> URL: https://issues.apache.org/jira/browse/SPARK-18792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover logit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18697) Upgrade sbt plugins

2016-12-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733760#comment-15733760
 ] 

Apache Spark commented on SPARK-18697:
--

User 'weiqingy' has created a pull request for this issue:
https://github.com/apache/spark/pull/16223

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18797) Update spark.logit in sparkr-vignettes

2016-12-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18797:


Assignee: Apache Spark

> Update spark.logit in sparkr-vignettes
> --
>
> Key: SPARK-18797
> URL: https://issues.apache.org/jira/browse/SPARK-18797
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Apache Spark
>
> spark.logit is added in 2.1. We need to update spark-vignettes to reflect the 
> changes. This is part of SparkR QA work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18797) Update spark.logit in sparkr-vignettes

2016-12-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18797:


Assignee: (was: Apache Spark)

> Update spark.logit in sparkr-vignettes
> --
>
> Key: SPARK-18797
> URL: https://issues.apache.org/jira/browse/SPARK-18797
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Miao Wang
>
> spark.logit is added in 2.1. We need to update spark-vignettes to reflect the 
> changes. This is part of SparkR QA work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18797) Update spark.logit in sparkr-vignettes

2016-12-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733746#comment-15733746
 ] 

Apache Spark commented on SPARK-18797:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/16222

> Update spark.logit in sparkr-vignettes
> --
>
> Key: SPARK-18797
> URL: https://issues.apache.org/jira/browse/SPARK-18797
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Miao Wang
>
> spark.logit is added in 2.1. We need to update spark-vignettes to reflect the 
> changes. This is part of SparkR QA work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18797) Update spark.logit in sparkr-vignettes

2016-12-08 Thread Miao Wang (JIRA)

Miao Wang created SPARK-18797:
-

 Summary: Update spark.logit in sparkr-vignettes
 Key: SPARK-18797
 URL: https://issues.apache.org/jira/browse/SPARK-18797
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Miao Wang


spark.logit is added in 2.1. We need to update spark-vignettes to reflect the 
changes. This is part of SparkR QA work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16448) RemoveAliasOnlyProject should not remove alias with metadata

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16448:
--
Component/s: SQL

> RemoveAliasOnlyProject should not remove alias with metadata
> 
>
> Key: SPARK-16448
> URL: https://issues.apache.org/jira/browse/SPARK-16448
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-12-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733687#comment-15733687
 ] 

Apache Spark commented on SPARK-18590:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/16221

> R - Include package vignettes and help pages, build source package in Spark 
> distribution
> 
>
> Key: SPARK-18590
> URL: https://issues.apache.org/jira/browse/SPARK-18590
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1
>
>
> We should include in Spark distribution the built source package for SparkR. 
> This will enable help and vignettes when the package is used. Also this 
> source package is what we would release to CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17239) User guide for multiclass logistic regression in spark.ml

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17239:
--
Component/s: Documentation

> User guide for multiclass logistic regression in spark.ml
> -
>
> Key: SPARK-17239
> URL: https://issues.apache.org/jira/browse/SPARK-17239
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17213:
--
Component/s: SQL

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17113) Job failure due to Executor OOM in offheap mode

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17113:
--
Component/s: Spark Core

> Job failure due to Executor OOM in offheap mode
> ---
>
> Key: SPARK-17113
> URL: https://issues.apache.org/jira/browse/SPARK-17113
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 2.0.1, 2.1.0
>
>
> We have been seeing many job failure due to executor OOM with following stack 
> trace - 
> {code}
> java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Digging into the code, we found out that this is an issue with cooperative 
> memory management for off heap memory allocation.  
> In the code 
> https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L463,
>  when the UnsafeExternalSorter is checking if memory page is being used by 
> upstream, the base object in case of off heap memory is always null so the 
> UnsafeExternalSorter does not spill the memory pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17162) Range does not support SQL generation

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17162:
--
Component/s: SQL

> Range does not support SQL generation
> -
>
> Key: SPARK-17162
> URL: https://issues.apache.org/jira/browse/SPARK-17162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> {code}
> scala> sql("create view a as select * from range(100)")
> 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as 
> select * from range(100)
> java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, 
> splits=8)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> ```
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16928:
--
Component/s: SQL

> Recursive call of ColumnVector::getInt() breaks JIT inlining
> 
>
> Key: SPARK-16928
> URL: https://issues.apache.org/jira/browse/SPARK-16928
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>Assignee: Qifan Pu
>Priority: Minor
> Fix For: 2.1.0
>
>
> In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() 
> with the following code pattern: 
> {code}
>   public int getInt(int rowId) {
> if (dictionary == null) {
>   return intData[rowId];
> } else {
>   return dictionary.decodeToInt(dictionaryIds.getInt(rowId));
> }
>   }
> {code}
> As dictionaryIds is also a ColumnVector, this results in a recursive call of 
> getInt() and breaks JIT inlining. As a result, getInt() will not get inlined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16898) Adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16898:
--
Component/s: SQL

> Adds argument type information for typed logical plan like MapElements, 
> TypedFilter, and AppendColumn
> -
>
> Key: SPARK-16898
> URL: https://issues.apache.org/jira/browse/SPARK-16898
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
> Fix For: 2.1.0
>
>
> Typed logical plan like MapElements, TypedFilter, and AppendColumn contains a 
> closure field: {{func: (T) => Boolean}}. For example class TypedFilter's 
> signature is:
> {code}
> case class TypedFilter(
> func: AnyRef,
> deserializer: Expression,
> child: LogicalPlan) extends UnaryNode
> {code} 
> From the above class signature, we cannot easily find:
> 1. What is the input argument's type of the closure {{func}}? How do we know 
> which apply method to pick if there are multiple overloaded apply methods?
> 2. What is the input argument's schema? 
> With this info, it is easier for us to define some custom optimizer rule to 
> translate these typed logical plan to more efficient implementation, like the 
> closure optimization idea in SPARK-14083.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16906) Adds more input type information for TypedAggregateExpression

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16906:
--
Component/s: SQL

> Adds more input type information for TypedAggregateExpression
> -
>
> Key: SPARK-16906
> URL: https://issues.apache.org/jira/browse/SPARK-16906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
> Fix For: 2.1.0
>
>
> For TypedAggregateExpression 
> {code}
> case class TypedAggregateExpression(
> aggregator: Aggregator[Any, Any, Any],
> inputDeserializer: Option[Expression],
> bufferSerializer: Seq[NamedExpression],
> bufferDeserializer: Expression,
> outputSerializer: Seq[Expression],
> outputExternalType: DataType,
> dataType: DataType,
> nullable: Boolean) extends DeclarativeAggregate with NonSQLExpression
> {code}
> Aggregator of TypedAggregateExpression usually contains a closure like: 
> {code}
> class TypedSumDouble[IN](f: IN => Double) extends Aggregator[IN, Double, 
> Double]
> {code}
> It will be great if we can add more info in TypedAggregateExpression to 
> describe the closure input type {{IN}}, like class, and schema, so that we 
> can use this info to make customized optimization like SPARK-14083



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16870:
--
Component/s: Documentation

> add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help 
> people to how to fix this timeout error when it happenned
> -
>
> Key: SPARK-16870
> URL: https://issues.apache.org/jira/browse/SPARK-16870
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Liang Ke
>Assignee: Liang Ke
>Priority: Trivial
> Fix For: 2.0.1, 2.1.0
>
>
> here my workload and what I found 
> I run a large number jobs with spark-sql at the same time. and meet the error 
> that print timeout (some job contains the broadcast-join operator) : 
> 16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING,
> java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute
>

[jira] [Updated] (SPARK-16853) Analysis error for DataSet typed selection

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16853:
--
Component/s: SQL

> Analysis error for DataSet typed selection
> --
>
> Key: SPARK-16853
> URL: https://issues.apache.org/jira/browse/SPARK-16853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>
> For DataSet typed selection
> {code}
> def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
> {code}
> If U1 contains sub-fields, then it reports AnalysisException 
> Reproducer:
> {code}
> scala> case class A(a: Int, b: Int)
> scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
> org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input 
> columns: [_2];
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16818:
--
Component/s: SQL

> Exchange reuse incorrectly reuses scans over different sets of partitions
> -
>
> Key: SPARK-16818
> URL: https://issues.apache.org/jira/browse/SPARK-16818
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Critical
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> This happens because the file scan operator does not take into account 
> partition pruning in its implementation of `sameResult()`. As a result, 
> executions may be incorrect on self-joins over the same base file relation. 
> Here's a minimal test case to reproduce:
> {code}
> spark.conf.set("spark.sql.exchange.reuse", true)  // defaults to true in 
> 2.0
> withTempPath { path =>
>   val tempDir = path.getCanonicalPath
>   spark.range(10)
> .selectExpr("id % 2 as a", "id % 3 as b", "id as c")
> .write
> .partitionBy("a")
> .parquet(tempDir)
>   val df = spark.read.parquet(tempDir)
>   val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum")
>   val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum")
>   checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, 
> 10, 5) :: Nil)
> {code}
> When exchange reuse is on, the result is
> {code}
> +---+--+--+
> |  b|sum(c)|sum(c)|
> +---+--+--+
> |  0| 6| 6|
> |  1| 4| 4|
> |  2|10|10|
> +---+--+--+
> {code}
> The correct result is
> {code}
> +---+--+--+
> |  b|sum(c)|sum(c)|
> +---+--+--+
> |  0| 6|12|
> |  1| 4| 8|
> |  2|10| 5|
> +---+--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18722) Move no data rate limit from StreamExecution to ProgressReporter

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18722:
--
Component/s: Structured Streaming

> Move no data rate limit from StreamExecution to ProgressReporter
> 
>
> Key: SPARK-18722
> URL: https://issues.apache.org/jira/browse/SPARK-18722
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.0
>
>
> So that we can also limit items in `recentProgresses`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18370) InsertIntoHadoopFsRelationCommand should keep track of its table

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18370:
--
Component/s: SQL

> InsertIntoHadoopFsRelationCommand should keep track of its table
> 
>
> Key: SPARK-18370
> URL: https://issues.apache.org/jira/browse/SPARK-18370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.1.0
>
>
> When we plan a {{InsertIntoHadoopFsRelationCommand}} we drop the {{Table}} 
> name. This is quite annoying when debugging plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18280:
--
Component/s: Spark Core

> Potential deadlock in `StandaloneSchedulerBackend.dead`
> ---
>
> Key: SPARK-18280
> URL: https://issues.apache.org/jira/browse/SPARK-18280
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.3, 2.1.0
>
>
> "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not 
> call "SparkContext.stop" in the same thread. "SparkContext.stop" will block 
> until all RPC threads exit, if it's called inside a RPC thread, it will be 
> dead-lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17994) Add back a file status cache for catalog tables

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17994:
--
Component/s: SQL

> Add back a file status cache for catalog tables
> ---
>
> Key: SPARK-17994
> URL: https://issues.apache.org/jira/browse/SPARK-17994
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> In SPARK-16980, we removed the full in-memory cache of table partitions in 
> favor of loading only needed partitions from the metastore. This greatly 
> improves the initial latency of queries that only read a small fraction of 
> table partitions.
> However, since the metastore does not store file statistics, we need to 
> discover those from remote storage. With the loss of the in-memory file 
> status cache this has to happen on each query, increasing the latency of 
> repeated queries over the same partitions.
> The proposal is to add back a per-table cache of partition contents, i.e. 
> Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can 
> be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
> cache, it can be incrementally updated as new partitions are read.
> cc [~michael]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18125:
--
Component/s: SQL

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Assignee: Liang-Chi Hsieh
>Priority: Critical
> Fix For: 2.0.3, 2.1.0
>
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();
> /* 078 */ if (isNull70) {
> /* 079 */   values1[18] = null;
> /* 080 */ } else {
> /* 081 */   values1[18] = value70;
> /* 082 */ }
> /* 083 */

[jira] [Updated] (SPARK-18103) Rename FileCatalog to FileProvider

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18103:
--
Component/s: SQL

> Rename *FileCatalog to *FileProvider
> 
>
> Key: SPARK-18103
> URL: https://issues.apache.org/jira/browse/SPARK-18103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> In the SQL component there are too many different components called some 
> variant of *Catalog, which is quite confusing. We should rename the 
> subclasses of FileCatalog to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17446) no total size for data source tables in InMemoryCatalog

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17446:
--
Component/s: SQL

> no total size for data source tables in InMemoryCatalog
> ---
>
> Key: SPARK-17446
> URL: https://issues.apache.org/jira/browse/SPARK-17446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>
> For data source table in InMemoryCatalog, it's 
> catalogTable.storage.locationUri is None, so total size can't be calculated. 
> But we can use the path parameter in catalogTable.storage.properties to 
> calculate size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17394) should not allow specify database in table/view name after RENAME TO

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17394:
--
Component/s: SQL

> should not allow specify database in table/view name after RENAME TO
> 
>
> Key: SPARK-17394
> URL: https://issues.apache.org/jira/browse/SPARK-17394
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17296:
--
Component/s: SQL

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Assignee: Herman van Hovell
> Fix For: 2.0.1, 2.1.0
>
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17114) Adding a 'GROUP BY 1' where first column is literal results in wrong answer

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17114:
--
Component/s: SQL

> Adding a 'GROUP BY 1' where first column is literal results in wrong answer
> ---
>
> Key: SPARK-17114
> URL: https://issues.apache.org/jira/browse/SPARK-17114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Herman van Hovell
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> Consider the following example:
> {code}
> sc.parallelize(Seq(128, 256)).toDF("int_col").registerTempTable("mytable")
> // The following query should return an empty result set because the `IN` 
> filter condition is always false for this single-row table.
> val withoutGroupBy = sqlContext.sql("""
>   SELECT 'foo'
>   FROM mytable
>   WHERE int_col == 0
> """)
> assert(withoutGroupBy.collect().isEmpty, "original query returned wrong 
> answer")
> // After adding a 'GROUP BY 1' the query result should still be empty because 
> we'd be grouping an empty table:
> val withGroupBy = sqlContext.sql("""
>   SELECT 'foo'
>   FROM mytable
>   WHERE int_col == 0
>   GROUP BY 1
> """)
> assert(withGroupBy.collect().isEmpty, "adding GROUP BY resulted in wrong 
> answer")
> {code}
> Here, this fails the second assertion by returning a single row. It appears 
> that running {{group by 1}} where column 1 is a constant causes filter 
> conditions to be ignored.
> Both PostgreSQL and SQLite return empty result sets for the query containing 
> the {{GROUP BY}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17034:
--
Component/s: SQL

> Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression
> -
>
> Key: SPARK-17034
> URL: https://issues.apache.org/jira/browse/SPARK-17034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>
> Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" 
> should be considered as unresolved before analysis. But in current code, it 
> uses "Literal" expression to store the ordinal. This is inappropriate as 
> "Literal" itself is a resolved expression, it gives the user a wrong message 
> that the ordinals has already been resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16955:
--
Component/s: SQL

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>   at 
>

[jira] [Updated] (SPARK-16888) Implements eval method for expression AssertNotNull

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16888:
--
Component/s: SQL

> Implements eval method for expression AssertNotNull
> ---
>
> Key: SPARK-16888
> URL: https://issues.apache.org/jira/browse/SPARK-16888
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
> Fix For: 2.1.0
>
>
> We should support eval() method for expression AssertNotNull
> Currently, it reports UnsupportedOperationException when used in projection 
> of LocalRelation.
> {code}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
> scala> import org.apache.spark.sql.Column
> scala> case class A(a: Int)
> scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, 
> Nil))).explain
> java.lang.UnsupportedOperationException: Only code-generated evaluation is 
> supported.
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16829) sparkR sc.setLogLevel doesn't work

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16829:
--
Component/s: SparkR

> sparkR sc.setLogLevel doesn't work
> --
>
> Key: SPARK-16829
> URL: https://issues.apache.org/jira/browse/SPARK-16829
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.0
>
>
> ./bin/sparkR
> Launching java with spark-submit command 
> /Users/mwang/spark_ws_0904/bin/spark-submit   "sparkr-shell" 
> /var/folders/s_/83b0sgvj2kl2kwq4stvft_pmgn/T//RtmpQxJGiZ/backend_porte9474603ed1e
>  
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> > sc.setLogLevel("INFO")
> Error: could not find function "sc.setLogLevel"
> sc.setLogLevel doesn't exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16884:
--
Component/s: SQL

> Move DataSourceScanExec out of ExistingRDD.scala file
> -
>
> Key: SPARK-16884
> URL: https://issues.apache.org/jira/browse/SPARK-16884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Trivial
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16138) YarnAllocator tries to cancel executor requests when we have none

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16138:
--
Component/s: YARN

> YarnAllocator tries to cancel executor requests when we have none
> -
>
> Key: SPARK-16138
> URL: https://issues.apache.org/jira/browse/SPARK-16138
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Peter Ableda
>Assignee: Peter Ableda
>Priority: Minor
> Fix For: 2.1.0
>
>
> Easy to reproduce: 
> $ spark-shell
> > sc.parallelize(1 to 1).count()
> After the computation finishes we have 0 pending tasks so we will need 0 
> executors. 
> {code}
> 16/06/21 04:07:41 INFO yarn.YarnAllocator: Driver requested a total number of 
> 0 executor(s).
> 16/06/21 04:07:41 INFO yarn.YarnAllocator: Canceling requests for 0 executor 
> containers
> 16/06/21 04:07:41 WARN yarn.YarnAllocator: Expected to find pending requests, 
> but found none.
> ...
> 16/06/21 04:08:41 INFO yarn.YarnAllocator: Canceling requests for 0 executor 
> containers
> 16/06/21 04:08:41 WARN yarn.YarnAllocator: Expected to find pending requests, 
> but found none.
> 16/06/21 04:08:41 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
> kill executor(s) 1.
> 16/06/21 04:08:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 0 executor(s).
> 16/06/21 04:08:48 INFO yarn.YarnAllocator: Canceling requests for 0 executor 
> containers
> 16/06/21 04:08:48 WARN yarn.YarnAllocator: Expected to find pending requests, 
> but found none.
> 16/06/21 04:08:48 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
> kill executor(s) 2.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15958) Make initial buffer size for the Sorter configurable

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15958:
--
Component/s: Spark Core

> Make initial buffer size for the Sorter configurable
> 
>
> Key: SPARK-15958
> URL: https://issues.apache.org/jira/browse/SPARK-15958
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently the initial buffer size in the sorter is hard coded inside the code 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java#L88)
>  and is too small for large workload. As a result, the sorter spends 
> significant time expanding the buffer size and copying the data. It would be 
> useful to have it configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15783) Fix more flakiness: o.a.s.scheduler.BlacklistIntegrationSuite

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15783:
--
Component/s: Spark Core

> Fix more flakiness: o.a.s.scheduler.BlacklistIntegrationSuite
> -
>
> Key: SPARK-15783
> URL: https://issues.apache.org/jira/browse/SPARK-15783
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.1.0
>
>
> Looks like SPARK-15714 didn't address all the sources of flakiness.  First 
> turning the test off to stop breaking builds, then will try to fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15621) BatchEvalPythonExec fails with OOM

2016-12-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15621:
--
Component/s: SQL

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>Assignee: Davies Liu
> Fix For: 2.1.0
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 203 matches

Mail list logo