date:20180115

[jira] [Updated] (SPARK-22739) Additional Expression Support for Objects

2018-01-15 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22739:
---
Priority: Major  (was: Critical)

> Additional Expression Support for Objects
> -
>
> Key: SPARK-22739
> URL: https://issues.apache.org/jira/browse/SPARK-22739
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Some discussion in Spark-Avro [1] motivates additions and minor changes to 
> the {{Objects}} Expressions API [2]. The proposed changes include
> * a generalized form of {{initializeJavaBean}} taking a sequence of 
> initialization expressions that can be applied to instances of varying objects
> * an object cast that performs a simple Java type cast against a value
> * making {{ExternalMapToCatalyst}} public, for use in outside libraries
> These changes would facilitate the writing of custom encoders for varying 
> objects that cannot already be readily converted to a statically typed 
> dataset by a JavaBean encoder (e.g. Avro).
> [1] -- 
> https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110
> [2] --
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-15 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-23020.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20223
[https://github.com/apache/spark/pull/20223]

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-15 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reassigned SPARK-23020:
--

Assignee: Marcelo Vanzin

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse

2018-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326795#comment-16326795
 ] 

Apache Spark commented on SPARK-23085:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/20275

> API parity for mllib.linalg.Vectors.sparse 
> ---
>
> Key: SPARK-23085
> URL: https://issues.apache.org/jira/browse/SPARK-23085
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: 
> {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support 
> zero-length vectors.
> In old MLLib,
> {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}])}} also supports it.
> However,
> {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a 
> positve length.
>  
> {code:java}
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res15: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> res16: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> java.lang.IllegalArgumentException: requirement failed: The size of the 
> requested sparse vector must be greater than 0.
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
>   ... 50 elided
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23085:


Assignee: (was: Apache Spark)

> API parity for mllib.linalg.Vectors.sparse 
> ---
>
> Key: SPARK-23085
> URL: https://issues.apache.org/jira/browse/SPARK-23085
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: 
> {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support 
> zero-length vectors.
> In old MLLib,
> {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}])}} also supports it.
> However,
> {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a 
> positve length.
>  
> {code:java}
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res15: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> res16: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> java.lang.IllegalArgumentException: requirement failed: The size of the 
> requested sparse vector must be greater than 0.
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
>   ... 50 elided
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23085:


Assignee: Apache Spark

> API parity for mllib.linalg.Vectors.sparse 
> ---
>
> Key: SPARK-23085
> URL: https://issues.apache.org/jira/browse/SPARK-23085
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: 
> {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support 
> zero-length vectors.
> In old MLLib,
> {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}])}} also supports it.
> However,
> {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a 
> positve length.
>  
> {code:java}
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res15: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> res16: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> java.lang.IllegalArgumentException: requirement failed: The size of the 
> requested sparse vector must be greater than 0.
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
>   ... 50 elided
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse

2018-01-15 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-23085:
-
Description: 
Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: 
{color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support 
zero-length vectors.

In old MLLib,

{{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}])}} also supports it.

However,

{{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a 
positve length.

 
{code:java}
scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res15: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)])
res16: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
Double)])
java.lang.IllegalArgumentException: requirement failed: The size of the 
requested sparse vector must be greater than 0.
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
  ... 50 elided

 

{code}

  was:
Both {ML.Vectors#sparse

size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}]

} and {{

ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])

}} support zero-length vectors.

In old MLLib,

{{MLLib.Vectors.sparse(

size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}]

)}} also supports it.

However,

{{

ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])

}} require a positve length.

 
{code:java}
scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res15: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)])
res16: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
Double)])
java.lang.IllegalArgumentException: requirement failed: The size of the 
requested sparse vector must be greater than 0.
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
  ... 50 elided

 

{code}


> API parity for mllib.linalg.Vectors.sparse 
> ---
>
> Key: SPARK-23085
> URL: https://issues.apache.org/jira/browse/SPARK-23085
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: 
> {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support 
> zero-length vectors.
> In old MLLib,
> {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}])}} also supports it.
> However,
> {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a 
> positve length.
>  
> {code:java}
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res15: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> res16: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> java.lang.IllegalArgumentException: requirement failed: The size of the 
>

[jira] [Updated] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse

2018-01-15 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-23085:
-
Description: 
Both {ML.Vectors#sparse

size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}]

} and {{

ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])

}} support zero-length vectors.

In old MLLib,

{{MLLib.Vectors.sparse(

size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}]

)}} also supports it.

However,

{{

ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])

}} require a positve length.

 
{code:java}
scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res15: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)])
res16: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
Double)])
java.lang.IllegalArgumentException: requirement failed: The size of the 
requested sparse vector must be greater than 0.
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
  ... 50 elided

 

{code}

  was:
Both {{ML.Vectors#sparse

size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}]

}} and {{

ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])

}} support zero-length vectors.

In old MLLib,

{{MLLib.Vectors.sparse(

size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}]

)}} also supports it.

However,

{{

ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])

}} require a positve length. 

 

{code}

scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res15: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)])
res16: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
Double)])
java.lang.IllegalArgumentException: requirement failed: The size of the 
requested sparse vector must be greater than 0.
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
  ... 50 elided

 

{code}


> API parity for mllib.linalg.Vectors.sparse 
> ---
>
> Key: SPARK-23085
> URL: https://issues.apache.org/jira/browse/SPARK-23085
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Both {ML.Vectors#sparse
> size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}]
> } and {{
> ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])
> }} support zero-length vectors.
> In old MLLib,
> {{MLLib.Vectors.sparse(
> size: {color:#cc7832}Int, {color}indices: 
> Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
> Array[{color:#cc7832}Double{color}]
> )}} also supports it.
> However,
> {{
> ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
> {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])
> }} require a positve length.
>  
> {code:java}
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res15: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> res16: org.apache.spark.ml.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
> Array.empty[Double])
> res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])
> scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
> Double)])
> java.lang.IllegalArgumentException: requirement failed: The

[jira] [Created] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse

2018-01-15 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-23085:


 Summary: API parity for mllib.linalg.Vectors.sparse 
 Key: SPARK-23085
 URL: https://issues.apache.org/jira/browse/SPARK-23085
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0
Reporter: zhengruifeng


Both {{ML.Vectors#sparse

size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}]

}} and {{

ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])

}} support zero-length vectors.

In old MLLib,

{{MLLib.Vectors.sparse(

size: {color:#cc7832}Int, {color}indices: 
Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: 
Array[{color:#cc7832}Double{color}]

)}} also supports it.

However,

{{

ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: 
{color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])

}} require a positve length. 

 

{code}

scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res15: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)])
res16: org.apache.spark.ml.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], 
Array.empty[Double])
res17: org.apache.spark.mllib.linalg.Vector = (0,[],[])

scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, 
Double)])
java.lang.IllegalArgumentException: requirement failed: The size of the 
requested sparse vector must be greater than 0.
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315)
  ... 50 elided

 

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22956) Union Stream Failover Cause `IllegalStateException`

2018-01-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22956:
-
Fix Version/s: (was: 2.3.0)
   2.4.0
   2.3.1

> Union Stream Failover Cause `IllegalStateException`
> ---
>
> Key: SPARK-22956
> URL: https://issues.apache.org/jira/browse/SPARK-22956
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> When we union 2 streams from kafka or other sources, while one of them have 
> no continues data coming and in the same time task restart, this will cause 
> an `IllegalStateException`. This mainly cause because the code in 
> [MicroBatchExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190]
>   , while one stream has no continues data, its comittedOffset same with 
> availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` 
> not properly handled in KafkaSource. Also, maybe we should also consider this 
> scenario in other Source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22956) Union Stream Failover Cause `IllegalStateException`

2018-01-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22956.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20150
[https://github.com/apache/spark/pull/20150]

> Union Stream Failover Cause `IllegalStateException`
> ---
>
> Key: SPARK-22956
> URL: https://issues.apache.org/jira/browse/SPARK-22956
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.3.0
>
>
> When we union 2 streams from kafka or other sources, while one of them have 
> no continues data coming and in the same time task restart, this will cause 
> an `IllegalStateException`. This mainly cause because the code in 
> [MicroBatchExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190]
>   , while one stream has no continues data, its comittedOffset same with 
> availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` 
> not properly handled in KafkaSource. Also, maybe we should also consider this 
> scenario in other Source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22956) Union Stream Failover Cause `IllegalStateException`

2018-01-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-22956:


Assignee: Li Yuanjian

> Union Stream Failover Cause `IllegalStateException`
> ---
>
> Key: SPARK-22956
> URL: https://issues.apache.org/jira/browse/SPARK-22956
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Major
>
> When we union 2 streams from kafka or other sources, while one of them have 
> no continues data coming and in the same time task restart, this will cause 
> an `IllegalStateException`. This mainly cause because the code in 
> [MicroBatchExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190]
>   , while one stream has no continues data, its comittedOffset same with 
> availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` 
> not properly handled in KafkaSource. Also, maybe we should also consider this 
> scenario in other Source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19700) Design an API for pluggable scheduler implementations

2018-01-15 Thread Anirudh Ramanathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-19700:
---
Issue Type: Improvement  (was: Bug)

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/

2018-01-15 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326711#comment-16326711
 ] 

Anirudh Ramanathan commented on SPARK-23083:


opened https://github.com/apache/spark-website/pull/87

> Adding Kubernetes as an option to https://spark.apache.org/
> ---
>
> Key: SPARK-23083
> URL: https://issues.apache.org/jira/browse/SPARK-23083
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> [https://spark.apache.org/] can now include a reference to, and the k8s logo.
> I think this is not tied to the docs.
> cc/ [~rxin] [~sameer]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23080) Improve error message for built-in functions

2018-01-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23080.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20271
[https://github.com/apache/spark/pull/20271]

> Improve error message for built-in functions
> 
>
> Key: SPARK-23080
> URL: https://issues.apache.org/jira/browse/SPARK-23080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 2.3.0
>
>
> When a user puts the wrong number of parameters in a function, an 
> AnalysisException is thrown. If the function is a UDF, he user is told how 
> many parameters the function expected and how many he/she put. If the 
> function, instead, is a built-in one, no information about the number of 
> parameters expected and the actual one is provided. This can help in some 
> cases, to debug the errors (eg. bad quotes escaping may lead to a different 
> number of parameters than expected, etc. etc.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23080) Improve error message for built-in functions

2018-01-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23080:


Assignee: Marco Gaido

> Improve error message for built-in functions
> 
>
> Key: SPARK-23080
> URL: https://issues.apache.org/jira/browse/SPARK-23080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 2.3.0
>
>
> When a user puts the wrong number of parameters in a function, an 
> AnalysisException is thrown. If the function is a UDF, he user is told how 
> many parameters the function expected and how many he/she put. If the 
> function, instead, is a built-in one, no information about the number of 
> parameters expected and the actual one is provided. This can help in some 
> cases, to debug the errors (eg. bad quotes escaping may lead to a different 
> number of parameters than expected, etc. etc.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22624) Expose range partitioning shuffle introduced by SPARK-22614

2018-01-15 Thread xubo245 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326707#comment-16326707
 ] 

xubo245 commented on SPARK-22624:
-

[~smilegator] ok, I will finished it.

> Expose range partitioning shuffle introduced by SPARK-22614
> ---
>
> Key: SPARK-22624
> URL: https://issues.apache.org/jira/browse/SPARK-22624
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Adrian Ionescu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23076) When we call cache() on RDD which depends on ShuffleRowRDD, we will get an error result

2018-01-15 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326685#comment-16326685
 ] 

zhoukang commented on SPARK-23076:
--

Yes,may it should be an improvement?Since i suppose some other may have same 
requirement.

And the original reason we cache the MapPartitionsRDD is that we do result size 
estimate in SparkExecuteStatementOperation#execute,so we cache the result rdd 
first.

The reason we estimate the result size is to avoid thriftserver oom error. 
[~cloud_fan]

> When we call cache() on RDD which depends on ShuffleRowRDD, we will get an 
> error result
> ---
>
> Key: SPARK-23076
> URL: https://issues.apache.org/jira/browse/SPARK-23076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Major
> Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Joe|20|
> |Tom|30|
> |Hyukjin|25|
> ++++--
>  However,when we call cache on MapPartitionsRDD below:
>   !shufflerowrdd-cache.png!
> Then result will be error:
>  0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Hyukjin|25|
> |Hyukjin|25|
> |Hyukjin|25|
> ++++--
>  The reason why this happen is that:
> UnsafeRow which generated by ShuffleRowRDD#compute will use the same under 
> byte buffer
> I print some log below to explain this:
> Modify UnsafeRow.toString()
> {code:java}
> // This is for debugging
> @Override
> public String toString() {
>   StringBuilder build = new StringBuilder("[");
>   for (int i = 0; i < sizeInBytes; i += 8) {
> if (i != 0) build.append(',');
> build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
> baseOffset + i)));
>   }
>   build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
> baseOffset here
>   return build.toString();
> }{code}
> {code:java}
> 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
> 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
> 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
> {code}
> we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
> iterator when config is true,like below:
> {code:java}
> reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
> if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
>   && x._2.isInstanceOf[UnsafeRow]) {
>   (x._2).asInstanceOf[UnsafeRow].copy()
> } else {
>   x._2
> }
>   })
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23076) When we call cache() on RDD which depends on ShuffleRowRDD, we will get an error result

2018-01-15 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-23076:
-
Issue Type: Improvement  (was: Bug)

> When we call cache() on RDD which depends on ShuffleRowRDD, we will get an 
> error result
> ---
>
> Key: SPARK-23076
> URL: https://issues.apache.org/jira/browse/SPARK-23076
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Major
> Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Joe|20|
> |Tom|30|
> |Hyukjin|25|
> ++++--
>  However,when we call cache on MapPartitionsRDD below:
>   !shufflerowrdd-cache.png!
> Then result will be error:
>  0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Hyukjin|25|
> |Hyukjin|25|
> |Hyukjin|25|
> ++++--
>  The reason why this happen is that:
> UnsafeRow which generated by ShuffleRowRDD#compute will use the same under 
> byte buffer
> I print some log below to explain this:
> Modify UnsafeRow.toString()
> {code:java}
> // This is for debugging
> @Override
> public String toString() {
>   StringBuilder build = new StringBuilder("[");
>   for (int i = 0; i < sizeInBytes; i += 8) {
> if (i != 0) build.append(',');
> build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
> baseOffset + i)));
>   }
>   build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
> baseOffset here
>   return build.toString();
> }{code}
> {code:java}
> 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
> 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
> 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
> {code}
> we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
> iterator when config is true,like below:
> {code:java}
> reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
> if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
>   && x._2.isInstanceOf[UnsafeRow]) {
>   (x._2).asInstanceOf[UnsafeRow].copy()
> } else {
>   x._2
> }
>   })
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20120) spark-sql CLI support silent mode

2018-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326673#comment-16326673
 ] 

Apache Spark commented on SPARK-20120:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/20274

> spark-sql CLI support silent mode
> -
>
> Key: SPARK-20120
> URL: https://issues.apache.org/jira/browse/SPARK-20120
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.2.0
>
>
> It is similar to Hive silent mode, just show the query result. see:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException

2018-01-15 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated SPARK-23035:

Description: 
 

Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
'$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.
{code:java}
 /**
   * Creates a global temp view, or issue an exception if the view already 
exists and
   * `overrideIfExists` is false.
   */
  def create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit = synchronized {
if (!overrideIfExists && viewDefinitions.contains(name)) {
  throw new TempTableAlreadyExistsException(name)
}
viewDefinitions.put(name, viewDefinition)
  }
{code}

No need to fix: 
 warning: TEMPORARY TABLE ... USING ... is deprecated and use 
TempViewAlreadyExistsException when create temp view
There are warning when run test: test("rename temporary view - destination 
table with database name")

02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
... USING ... instead
 other test cases also have this warning

  was:
 

Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
'$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.
{code:java}
 /**
   * Creates a global temp view, or issue an exception if the view already 
exists and
   * `overrideIfExists` is false.
   */
  def create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit = synchronized {
if (!overrideIfExists && viewDefinitions.contains(name)) {
  throw new TempTableAlreadyExistsException(name)
}
viewDefinitions.put(name, viewDefinition)
  }


No need to fix: 
 warning: TEMPORARY TABLE ... USING ... is deprecated and use 
TempViewAlreadyExistsException when create temp view{code}
There are warning when run test: test("rename temporary view - destination 
table with database name")

02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
... USING ... instead
 other test cases also have this warning


> Fix improper information of TempTableAlreadyExistsException
> ---
>
> Key: SPARK-23035
> URL: https://issues.apache.org/jira/browse/SPARK-23035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Fix For: 2.3.0
>
>
>  
> Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
> '$table' already exists" when we create temp view by using 
> org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
> improper.
> {code:java}
>  /**
>* Creates a global temp view, or issue an exception if the view already 
> exists and
>* `overrideIfExists` is false.
>*/
>   def create(
>   name: String,
>   viewDefinition: LogicalPlan,
>   overrideIfExists: Boolean): Unit = synchronized {
> if (!overrideIfExists && viewDefinitions.contains(name)) {
>   throw new TempTableAlreadyExistsException(name)
> }
> viewDefinitions.put(name, viewDefinition)
>   }
> {code}
> No need to fix: 
>  warning: TEMPORARY TABLE ... USING ... is deprecated and use 
> TempViewAlreadyExistsException when create temp view
> There are warning when run test: test("rename temporary view - destination 
> table with database name")
> 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
> TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
> ... USING ... instead
>  other test cases also have this warning



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException

2018-01-15 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated SPARK-23035:

Description: 
 

Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
'$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.
{code:java}
 /**
   * Creates a global temp view, or issue an exception if the view already 
exists and
   * `overrideIfExists` is false.
   */
  def create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit = synchronized {
if (!overrideIfExists && viewDefinitions.contains(name)) {
  throw new TempTableAlreadyExistsException(name)
}
viewDefinitions.put(name, viewDefinition)
  }


No need to fix: 
 warning: TEMPORARY TABLE ... USING ... is deprecated and use 
TempViewAlreadyExistsException when create temp view{code}
There are warning when run test: test("rename temporary view - destination 
table with database name")

02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
... USING ... instead
 other test cases also have this warning

  was:
 

Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
'$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.
{code:java}
 /**
   * Creates a global temp view, or issue an exception if the view already 
exists and
   * `overrideIfExists` is false.
   */
  def create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit = synchronized {
if (!overrideIfExists && viewDefinitions.contains(name)) {
  throw new TempTableAlreadyExistsException(name)
}
viewDefinitions.put(name, viewDefinition)
  }


{code}

*No need to fix: 
* warning: TEMPORARY TABLE ... USING ... is deprecated and use 
TempViewAlreadyExistsException when create temp view

There are warning when run test: test("rename temporary view - destination 
table with database name")

02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
... USING ... instead
other test cases also have this warning



> Fix improper information of TempTableAlreadyExistsException
> ---
>
> Key: SPARK-23035
> URL: https://issues.apache.org/jira/browse/SPARK-23035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Fix For: 2.3.0
>
>
>  
> Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
> '$table' already exists" when we create temp view by using 
> org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
> improper.
> {code:java}
>  /**
>* Creates a global temp view, or issue an exception if the view already 
> exists and
>* `overrideIfExists` is false.
>*/
>   def create(
>   name: String,
>   viewDefinition: LogicalPlan,
>   overrideIfExists: Boolean): Unit = synchronized {
> if (!overrideIfExists && viewDefinitions.contains(name)) {
>   throw new TempTableAlreadyExistsException(name)
> }
> viewDefinitions.put(name, viewDefinition)
>   }
> No need to fix: 
>  warning: TEMPORARY TABLE ... USING ... is deprecated and use 
> TempViewAlreadyExistsException when create temp view{code}
> There are warning when run test: test("rename temporary view - destination 
> table with database name")
> 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
> TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
> ... USING ... instead
>  other test cases also have this warning



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException

2018-01-15 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated SPARK-23035:

Description: 
 

Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
'$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.
{code:java}
 /**
   * Creates a global temp view, or issue an exception if the view already 
exists and
   * `overrideIfExists` is false.
   */
  def create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit = synchronized {
if (!overrideIfExists && viewDefinitions.contains(name)) {
  throw new TempTableAlreadyExistsException(name)
}
viewDefinitions.put(name, viewDefinition)
  }


{code}

*No need to fix: 
* warning: TEMPORARY TABLE ... USING ... is deprecated and use 
TempViewAlreadyExistsException when create temp view

There are warning when run test: test("rename temporary view - destination 
table with database name")

02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
... USING ... instead
other test cases also have this warning


  was:
 

Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
'$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.
{code:java}
 /**
   * Creates a global temp view, or issue an exception if the view already 
exists and
   * `overrideIfExists` is false.
   */
  def create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit = synchronized {
if (!overrideIfExists && viewDefinitions.contains(name)) {
  throw new TempTableAlreadyExistsException(name)
}
viewDefinitions.put(name, viewDefinition)
  }


{code}

*No need to fix: *
 warning: TEMPORARY TABLE ... USING ... is deprecated and use 
TempViewAlreadyExistsException when create temp view

There are warning when run test: test("rename temporary view - destination 
table with database name")

02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
... USING ... instead
other test cases also have this warning



> Fix improper information of TempTableAlreadyExistsException
> ---
>
> Key: SPARK-23035
> URL: https://issues.apache.org/jira/browse/SPARK-23035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Fix For: 2.3.0
>
>
>  
> Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
> '$table' already exists" when we create temp view by using 
> org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
> improper.
> {code:java}
>  /**
>* Creates a global temp view, or issue an exception if the view already 
> exists and
>* `overrideIfExists` is false.
>*/
>   def create(
>   name: String,
>   viewDefinition: LogicalPlan,
>   overrideIfExists: Boolean): Unit = synchronized {
> if (!overrideIfExists && viewDefinitions.contains(name)) {
>   throw new TempTableAlreadyExistsException(name)
> }
> viewDefinitions.put(name, viewDefinition)
>   }
> {code}
> *No need to fix: 
> * warning: TEMPORARY TABLE ... USING ... is deprecated and use 
> TempViewAlreadyExistsException when create temp view
> There are warning when run test: test("rename temporary view - destination 
> table with database name")
> 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
> TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
> ... USING ... instead
> other test cases also have this warning



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException

2018-01-15 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated SPARK-23035:

Description: 
 

Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
'$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.
{code:java}
 /**
   * Creates a global temp view, or issue an exception if the view already 
exists and
   * `overrideIfExists` is false.
   */
  def create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit = synchronized {
if (!overrideIfExists && viewDefinitions.contains(name)) {
  throw new TempTableAlreadyExistsException(name)
}
viewDefinitions.put(name, viewDefinition)
  }


{code}

*No need to fix: *
 warning: TEMPORARY TABLE ... USING ... is deprecated and use 
TempViewAlreadyExistsException when create temp view

There are warning when run test: test("rename temporary view - destination 
table with database name")

02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
... USING ... instead
other test cases also have this warning


  was:
Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
TempViewAlreadyExistsException when create temp view

There are warning when run test:   test("rename temporary view - destination 
table with database name") 

{code:java}
02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
... USING ... instead
{code}

other test cases also have this warning

Another problem, it throw TempTableAlreadyExistsException and output "Temporary 
table '$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.

{code:java}

 /**
   * Creates a global temp view, or issue an exception if the view already 
exists and
   * `overrideIfExists` is false.
   */
  def create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit = synchronized {
if (!overrideIfExists && viewDefinitions.contains(name)) {
  throw new TempTableAlreadyExistsException(name)
}
viewDefinitions.put(name, viewDefinition)
  }


{code}



> Fix improper information of TempTableAlreadyExistsException
> ---
>
> Key: SPARK-23035
> URL: https://issues.apache.org/jira/browse/SPARK-23035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Fix For: 2.3.0
>
>
>  
> Problem: it throw TempTableAlreadyExistsException and output "Temporary table 
> '$table' already exists" when we create temp view by using 
> org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
> improper.
> {code:java}
>  /**
>* Creates a global temp view, or issue an exception if the view already 
> exists and
>* `overrideIfExists` is false.
>*/
>   def create(
>   name: String,
>   viewDefinition: LogicalPlan,
>   overrideIfExists: Boolean): Unit = synchronized {
> if (!overrideIfExists && viewDefinitions.contains(name)) {
>   throw new TempTableAlreadyExistsException(name)
> }
> viewDefinitions.put(name, viewDefinition)
>   }
> {code}
> *No need to fix: *
>  warning: TEMPORARY TABLE ... USING ... is deprecated and use 
> TempViewAlreadyExistsException when create temp view
> There are warning when run test: test("rename temporary view - destination 
> table with database name")
> 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
> TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
> ... USING ... instead
> other test cases also have this warning



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException

2018-01-15 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated SPARK-23035:

Summary: Fix improper information of TempTableAlreadyExistsException  (was: 
Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
TempViewAlreadyExistsException when create temp view)

> Fix improper information of TempTableAlreadyExistsException
> ---
>
> Key: SPARK-23035
> URL: https://issues.apache.org/jira/browse/SPARK-23035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Fix For: 2.3.0
>
>
> Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
> TempViewAlreadyExistsException when create temp view
> There are warning when run test:   test("rename temporary view - destination 
> table with database name") 
> {code:java}
> 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
> TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
> ... USING ... instead
> {code}
> other test cases also have this warning
> Another problem, it throw TempTableAlreadyExistsException and output 
> "Temporary table '$table' already exists" when we create temp view by using 
> org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
> improper.
> {code:java}
>  /**
>* Creates a global temp view, or issue an exception if the view already 
> exists and
>* `overrideIfExists` is false.
>*/
>   def create(
>   name: String,
>   viewDefinition: LogicalPlan,
>   overrideIfExists: Boolean): Unit = synchronized {
> if (!overrideIfExists && viewDefinitions.contains(name)) {
>   throw new TempTableAlreadyExistsException(name)
> }
> viewDefinitions.put(name, viewDefinition)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3

2018-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326663#comment-16326663
 ] 

Apache Spark commented on SPARK-23000:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/20273

> Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
> -
>
> Key: SPARK-23000
> URL: https://issues.apache.org/jira/browse/SPARK-23000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
> The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always 
> failed in hadoop 2.6 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark

2018-01-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23084:

Description: 
Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) to 
PySpark. Also update the rangeBetween API

{noformat}

/**
 * Window function: returns the special frame boundary that represents the 
first row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedPreceding(): Column = Column(UnboundedPreceding)

/**
 * Window function: returns the special frame boundary that represents the last 
row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedFollowing(): Column = Column(UnboundedFollowing)

/**
 * Window function: returns the special frame boundary that represents the 
current row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def currentRow(): Column = Column(CurrentRow)

{noformat}

  was:
Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) to 
PySpark

{noformat}

/**
 * Window function: returns the special frame boundary that represents the 
first row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedPreceding(): Column = Column(UnboundedPreceding)

/**
 * Window function: returns the special frame boundary that represents the last 
row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedFollowing(): Column = Column(UnboundedFollowing)

/**
 * Window function: returns the special frame boundary that represents the 
current row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def currentRow(): Column = Column(CurrentRow)

{noformat}


> Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark 
> ---
>
> Key: SPARK-23084
> URL: https://issues.apache.org/jira/browse/SPARK-23084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) 
> to PySpark. Also update the rangeBetween API
> {noformat}
> /**
>  * Window function: returns the special frame boundary that represents the 
> first row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedPreceding(): Column = Column(UnboundedPreceding)
> /**
>  * Window function: returns the special frame boundary that represents the 
> last row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedFollowing(): Column = Column(UnboundedFollowing)
> /**
>  * Window function: returns the special frame boundary that represents the 
> current row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def currentRow(): Column = Column(CurrentRow)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark

2018-01-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23084:

Description: 
Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) to 
PySpark

{noformat}

/**
 * Window function: returns the special frame boundary that represents the 
first row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedPreceding(): Column = Column(UnboundedPreceding)

/**
 * Window function: returns the special frame boundary that represents the last 
row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedFollowing(): Column = Column(UnboundedFollowing)

/**
 * Window function: returns the special frame boundary that represents the 
current row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def currentRow(): Column = Column(CurrentRow)

{noformat}

> Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark 
> ---
>
> Key: SPARK-23084
> URL: https://issues.apache.org/jira/browse/SPARK-23084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Add the new APIs (introduced by 
> https://github.com/apache/spark/pull/18814) to PySpark
> {noformat}
> /**
>  * Window function: returns the special frame boundary that represents the 
> first row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedPreceding(): Column = Column(UnboundedPreceding)
> /**
>  * Window function: returns the special frame boundary that represents the 
> last row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedFollowing(): Column = Column(UnboundedFollowing)
> /**
>  * Window function: returns the special frame boundary that represents the 
> current row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def currentRow(): Column = Column(CurrentRow)
> {noformat}
>Reporter: Xiao Li
>Priority: Major
>
> Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) 
> to PySpark
> {noformat}
> /**
>  * Window function: returns the special frame boundary that represents the 
> first row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedPreceding(): Column = Column(UnboundedPreceding)
> /**
>  * Window function: returns the special frame boundary that represents the 
> last row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedFollowing(): Column = Column(UnboundedFollowing)
> /**
>  * Window function: returns the special frame boundary that represents the 
> current row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def currentRow(): Column = Column(CurrentRow)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark

2018-01-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23084:

Environment: (was: Add the new APIs (introduced by 
https://github.com/apache/spark/pull/18814) to PySpark

{noformat}

/**
 * Window function: returns the special frame boundary that represents the 
first row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedPreceding(): Column = Column(UnboundedPreceding)

/**
 * Window function: returns the special frame boundary that represents the last 
row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedFollowing(): Column = Column(UnboundedFollowing)

/**
 * Window function: returns the special frame boundary that represents the 
current row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def currentRow(): Column = Column(CurrentRow)

{noformat})

> Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark 
> ---
>
> Key: SPARK-23084
> URL: https://issues.apache.org/jira/browse/SPARK-23084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) 
> to PySpark
> {noformat}
> /**
>  * Window function: returns the special frame boundary that represents the 
> first row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedPreceding(): Column = Column(UnboundedPreceding)
> /**
>  * Window function: returns the special frame boundary that represents the 
> last row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedFollowing(): Column = Column(UnboundedFollowing)
> /**
>  * Window function: returns the special frame boundary that represents the 
> current row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def currentRow(): Column = Column(CurrentRow)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark

2018-01-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-23084:
---

 Summary: Add unboundedPreceding(), unboundedFollowing() and 
currentRow() to PySpark 
 Key: SPARK-23084
 URL: https://issues.apache.org/jira/browse/SPARK-23084
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
 Environment: Add the new APIs (introduced by 
https://github.com/apache/spark/pull/18814) to PySpark

{noformat}

/**
 * Window function: returns the special frame boundary that represents the 
first row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedPreceding(): Column = Column(UnboundedPreceding)

/**
 * Window function: returns the special frame boundary that represents the last 
row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def unboundedFollowing(): Column = Column(UnboundedFollowing)

/**
 * Window function: returns the special frame boundary that represents the 
current row in the
 * window partition.
 *
 * @group window_funcs
 * @since 2.3.0
 */
 def currentRow(): Column = Column(CurrentRow)

{noformat}
Reporter: Xiao Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/

2018-01-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326635#comment-16326635
 ] 

Sean Owen commented on SPARK-23083:
---

Yes that's fine. If it only makes sense after the 2.3 release then we'll wait 
to merge the PR.

> Adding Kubernetes as an option to https://spark.apache.org/
> ---
>
> Key: SPARK-23083
> URL: https://issues.apache.org/jira/browse/SPARK-23083
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> [https://spark.apache.org/] can now include a reference to, and the k8s logo.
> I think this is not tied to the docs.
> cc/ [~rxin] [~sameer]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/

2018-01-15 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326631#comment-16326631
 ] 

Anirudh Ramanathan commented on SPARK-23083:


Thanks. I'll create a PR against that repo.

Since it's separate, it can be merged and updated right around the time the 
release goes out?

> Adding Kubernetes as an option to https://spark.apache.org/
> ---
>
> Key: SPARK-23083
> URL: https://issues.apache.org/jira/browse/SPARK-23083
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> [https://spark.apache.org/] can now include a reference to, and the k8s logo.
> I think this is not tied to the docs.
> cc/ [~rxin] [~sameer]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/

2018-01-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326619#comment-16326619
 ] 

Reynold Xin commented on SPARK-23083:
-

Here's the website repo: [https://github.com/apache/spark-website]

 

> Adding Kubernetes as an option to https://spark.apache.org/
> ---
>
> Key: SPARK-23083
> URL: https://issues.apache.org/jira/browse/SPARK-23083
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> [https://spark.apache.org/] can now include a reference to, and the k8s logo.
> I think this is not tied to the docs.
> cc/ [~rxin] [~sameer]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/

2018-01-15 Thread Anirudh Ramanathan (JIRA)

Anirudh Ramanathan created SPARK-23083:
--

 Summary: Adding Kubernetes as an option to 
https://spark.apache.org/
 Key: SPARK-23083
 URL: https://issues.apache.org/jira/browse/SPARK-23083
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Anirudh Ramanathan


[https://spark.apache.org/] can now include a reference to, and the k8s logo.

I think this is not tied to the docs.

cc/ [~rxin] [~sameer]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326521#comment-16326521
 ] 

Apache Spark commented on SPARK-23078:
--

User 'ozzieba' has created a pull request for this issue:
https://github.com/apache/spark/pull/20272

> Allow Submitting Spark Thrift Server in Cluster Mode
> 
>
> Key: SPARK-23078
> URL: https://issues.apache.org/jira/browse/SPARK-23078
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running 
> in Cluster mode, since at the time it was not able to do so successfully. I 
> have confirmed that Spark Thrift Server can run on Cluster mode in 
> Kubernetes, by commenting out 
> [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
>  I have not had a chance to test against YARN. Since Kubernetes does not have 
> Client mode, this change is necessary to run Spark Thrift Service in 
> Kubernetes.
> [~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23078:


Assignee: Apache Spark

> Allow Submitting Spark Thrift Server in Cluster Mode
> 
>
> Key: SPARK-23078
> URL: https://issues.apache.org/jira/browse/SPARK-23078
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Oz Ben-Ami
>Assignee: Apache Spark
>Priority: Minor
>
> Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running 
> in Cluster mode, since at the time it was not able to do so successfully. I 
> have confirmed that Spark Thrift Server can run on Cluster mode in 
> Kubernetes, by commenting out 
> [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
>  I have not had a chance to test against YARN. Since Kubernetes does not have 
> Client mode, this change is necessary to run Spark Thrift Service in 
> Kubernetes.
> [~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23078:


Assignee: (was: Apache Spark)

> Allow Submitting Spark Thrift Server in Cluster Mode
> 
>
> Key: SPARK-23078
> URL: https://issues.apache.org/jira/browse/SPARK-23078
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running 
> in Cluster mode, since at the time it was not able to do so successfully. I 
> have confirmed that Spark Thrift Server can run on Cluster mode in 
> Kubernetes, by commenting out 
> [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
>  I have not had a chance to test against YARN. Since Kubernetes does not have 
> Client mode, this change is necessary to run Spark Thrift Service in 
> Kubernetes.
> [~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23074) Dataframe-ified zipwithindex

2018-01-15 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326489#comment-16326489
 ] 

Ruslan Dautkhanov commented on SPARK-23074:
---

Yep, there are use cases where ordering is provided like reading files. We run 
production jobs that require going back to rdd api just to do zipWithIndex 
which is not as straight-forward as it could be if if there would be a 
dataframe-level API for zipWithIndex().. and not as performant.

That SO answer got almost 30 upvotes in 2 years so I know we're not alone and 
it could benefit many others.

Thanks.

> Dataframe-ified zipwithindex
> 
>
> Key: SPARK-23074
> URL: https://issues.apache.org/jira/browse/SPARK-23074
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Minor
>  Labels: dataframe, rdd
>
> Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex():
> {code:java}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types.{LongType, StructField, StructType}
> import org.apache.spark.sql.Row
> def dfZipWithIndex(
>   df: DataFrame,
>   offset: Int = 1,
>   colName: String = "id",
>   inFront: Boolean = true
> ) : DataFrame = {
>   df.sqlContext.createDataFrame(
> df.rdd.zipWithIndex.map(ln =>
>   Row.fromSeq(
> (if (inFront) Seq(ln._2 + offset) else Seq())
>   ++ ln._1.toSeq ++
> (if (inFront) Seq() else Seq(ln._2 + offset))
>   )
> ),
> StructType(
>   (if (inFront) Array(StructField(colName,LongType,false)) else 
> Array[StructField]()) 
> ++ df.schema.fields ++ 
>   (if (inFront) Array[StructField]() else 
> Array(StructField(colName,LongType,false)))
> )
>   ) 
> }
> {code}
> credits: 
> [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23074) Dataframe-ified zipwithindex

2018-01-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326479#comment-16326479
 ] 

Sean Owen commented on SPARK-23074:
---

Hm, rowNumber requires you to sort the input? I didn't think it did, 
semantically. The numbering isn't so meaningful unless the input has a defined 
ordering, sure, but the same is true of an RDD. Unless you sort it, the 
indexing could change when it's evaluated again.

You're not really guaranteed what order you see the data, although in practice, 
like in your example, you will get data from things like files in the order you 
expect.

> Dataframe-ified zipwithindex
> 
>
> Key: SPARK-23074
> URL: https://issues.apache.org/jira/browse/SPARK-23074
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Minor
>  Labels: dataframe, rdd
>
> Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex():
> {code:java}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types.{LongType, StructField, StructType}
> import org.apache.spark.sql.Row
> def dfZipWithIndex(
>   df: DataFrame,
>   offset: Int = 1,
>   colName: String = "id",
>   inFront: Boolean = true
> ) : DataFrame = {
>   df.sqlContext.createDataFrame(
> df.rdd.zipWithIndex.map(ln =>
>   Row.fromSeq(
> (if (inFront) Seq(ln._2 + offset) else Seq())
>   ++ ln._1.toSeq ++
> (if (inFront) Seq() else Seq(ln._2 + offset))
>   )
> ),
> StructType(
>   (if (inFront) Array(StructField(colName,LongType,false)) else 
> Array[StructField]()) 
> ++ df.schema.fields ++ 
>   (if (inFront) Array[StructField]() else 
> Array(StructField(colName,LongType,false)))
> )
>   ) 
> }
> {code}
> credits: 
> [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes

2018-01-15 Thread Oz Ben-Ami (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oz Ben-Ami updated SPARK-23082:
---
Description: 
In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark 
driver to a different set of nodes from its executors. In Kubernetes, we can 
specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use 
separate options for the driver and executors.

This would be useful for the particular use case where executors can go on more 
ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot instances), 
but the driver should use a more persistent machine.

The required change would be minimal, essentially just using different config 
keys for the 
[driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
 and 
[executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
 instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both.

  was:
In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark 
driver to a different set of nodes from its executors. In Kubernetes, we can 
specify spark.kubernetes.node.selector.[labelKey], but we can't use separate 
options for the driver and executors.

This would be useful for the particular use case where executors can go on more 
ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot instances), 
but the driver should use a more persistent machine.

The required change would be minimal, essentially just using different config 
keys in 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
 and 
[https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
 instead of KUBERNETES_NODE_SELECTOR_PREFIX for both.


> Allow separate node selectors for driver and executors in Kubernetes
> 
>
> Key: SPARK-23082
> URL: https://issues.apache.org/jira/browse/SPARK-23082
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark 
> driver to a different set of nodes from its executors. In Kubernetes, we can 
> specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use 
> separate options for the driver and executors.
> This would be useful for the particular use case where executors can go on 
> more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot 
> instances), but the driver should use a more persistent machine.
> The required change would be minimal, essentially just using different config 
> keys for the 
> [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
>  and 
> [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
>  instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes

2018-01-15 Thread Oz Ben-Ami (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oz Ben-Ami updated SPARK-23082:
---
Description: 
In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark 
driver to a different set of nodes from its executors. In Kubernetes, we can 
specify spark.kubernetes.node.selector.[labelKey], but we can't use separate 
options for the driver and executors.

This would be useful for the particular use case where executors can go on more 
ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot instances), 
but the driver should use a more persistent machine.

The required change would be minimal, essentially just using different config 
keys in 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
 and 
[https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
 instead of KUBERNETES_NODE_SELECTOR_PREFIX for both.

  was:
In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark 
driver to a different set of nodes from its executors. In Kubernetes, we can 
specify spark.kubernetes.node.selector.[labelKey], but we can't use separate 
options for the driver and executors. This would be useful for the particular 
use case where executors can go on more ephemeral nodes (eg, with cluster 
autoscaling, or preemptible/spot instances), but the driver should use a more 
persistent machine.
The required change would be minimal, essentially just using different config 
keys in 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
 and 
[https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
 instead of KUBERNETES_NODE_SELECTOR_PREFIX for both.


> Allow separate node selectors for driver and executors in Kubernetes
> 
>
> Key: SPARK-23082
> URL: https://issues.apache.org/jira/browse/SPARK-23082
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark 
> driver to a different set of nodes from its executors. In Kubernetes, we can 
> specify spark.kubernetes.node.selector.[labelKey], but we can't use separate 
> options for the driver and executors.
> This would be useful for the particular use case where executors can go on 
> more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot 
> instances), but the driver should use a more persistent machine.
> The required change would be minimal, essentially just using different config 
> keys in 
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
>  and 
> [https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
>  instead of KUBERNETES_NODE_SELECTOR_PREFIX for both.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes

2018-01-15 Thread Oz Ben-Ami (JIRA)

Oz Ben-Ami created SPARK-23082:
--

 Summary: Allow separate node selectors for driver and executors in 
Kubernetes
 Key: SPARK-23082
 URL: https://issues.apache.org/jira/browse/SPARK-23082
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Submit
Affects Versions: 2.2.0, 2.3.0
Reporter: Oz Ben-Ami


In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark 
driver to a different set of nodes from its executors. In Kubernetes, we can 
specify spark.kubernetes.node.selector.[labelKey], but we can't use separate 
options for the driver and executors. This would be useful for the particular 
use case where executors can go on more ephemeral nodes (eg, with cluster 
autoscaling, or preemptible/spot instances), but the driver should use a more 
persistent machine.
The required change would be minimal, essentially just using different config 
keys in 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
 and 
[https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
 instead of KUBERNETES_NODE_SELECTOR_PREFIX for both.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21108) convert LinearSVC to aggregator framework

2018-01-15 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21108.

Resolution: Fixed

> convert LinearSVC to aggregator framework
> -
>
> Key: SPARK-21108
> URL: https://issues.apache.org/jira/browse/SPARK-21108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12717) pyspark broadcast fails when using multiple threads

2018-01-15 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326459#comment-16326459
 ] 

Bryan Cutler commented on SPARK-12717:
--

Hi [~codlife], you can use Spark 2.2.1 which was released in December or the 
upcoming release of 2.3.0, both have this fix.

> pyspark broadcast fails when using multiple threads
> ---
>
> Key: SPARK-12717
> URL: https://issues.apache.org/jira/browse/SPARK-12717
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Linux, python 2.6 or python 2.7.
>Reporter: Edward Walker
>Assignee: Bryan Cutler
>Priority: Critical
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
> Attachments: run.log
>
>
> The following multi-threaded program that uses broadcast variables 
> consistently throws exceptions like:  *Exception("Broadcast variable '18' not 
> loaded!",)* --- even when run with "--master local[10]".
> {code:title=bug_spark.py|borderStyle=solid}
> try:  
>  
> import pyspark
>  
> except:   
>  
> pass  
>  
> from optparse import OptionParser 
>  
>   
>  
> def my_option_parser():   
>  
> op = OptionParser()   
>  
> op.add_option("--parallelism", dest="parallelism", type="int", 
> default=20)  
> return op 
>  
>   
>  
> def do_process(x, w): 
>  
> return x * w.value
>  
>   
>  
> def func(name, rdd, conf):
>  
> new_rdd = rdd.map(lambda x :   do_process(x, conf))   
>  
> total = new_rdd.reduce(lambda x, y : x + y)   
>  
> count = rdd.count()   
>  
> print name, 1.0 * total / count   
>  
>   
>  
> if __name__ == "__main__":
>  
> import threading  
>  
> op = my_option_parser()   
>  
> options, args = op.parse_args()   
>  
> sc = pyspark.SparkContext(appName="Buggy")
>  
> data_rdd = sc.parallelize(range(0,1000), 1)   
>  
> confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]  
>  
> threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
> data_rdd, confs[i]]) for i in xrange(options.parallelism) ]   
>
> for t in threads: 
>  
> t.start() 
>  
> for t in threads: 
>  
> t.join() 
> {code}
> Abridged run output:
>

[jira] [Commented] (SPARK-23074) Dataframe-ified zipwithindex

2018-01-15 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326455#comment-16326455
 ] 

Ruslan Dautkhanov commented on SPARK-23074:
---

{quote}You can create a DataFrame from the result of .zipWithIndex on an RDD, 
as you see here.{quote}

That's very kludgy as you can see the code snippet above. A direct 
functionality for this RDD-level api would be super awesome to have.

{quote}There's already a rowNumber function in Spark SQL, however, which sounds 
like the native equivalent?{quote}

That's not the same. rowNumber requires an expression to `order by` on. What if 
there is no such column? For example, we often 
have files that we ingest into Spark and where we physical position of a record 
is meaningful how that record has to be processed. 
I can give exact example if you're necessary. zipWithIndex() actually the only 
one API call that preserves such information from 
original source (even though it can be distributed into multiple partitions 
etc.).
Also as folks in that stackoverflow question said, rowNumber approach is way 
slower (and it's not surprizing as it requires data sorting).



> Dataframe-ified zipwithindex
> 
>
> Key: SPARK-23074
> URL: https://issues.apache.org/jira/browse/SPARK-23074
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Minor
>  Labels: dataframe, rdd
>
> Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex():
> {code:java}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types.{LongType, StructField, StructType}
> import org.apache.spark.sql.Row
> def dfZipWithIndex(
>   df: DataFrame,
>   offset: Int = 1,
>   colName: String = "id",
>   inFront: Boolean = true
> ) : DataFrame = {
>   df.sqlContext.createDataFrame(
> df.rdd.zipWithIndex.map(ln =>
>   Row.fromSeq(
> (if (inFront) Seq(ln._2 + offset) else Seq())
>   ++ ln._1.toSeq ++
> (if (inFront) Seq() else Seq(ln._2 + offset))
>   )
> ),
> StructType(
>   (if (inFront) Array(StructField(colName,LongType,false)) else 
> Array[StructField]()) 
> ++ df.schema.fields ++ 
>   (if (inFront) Array[StructField]() else 
> Array(StructField(colName,LongType,false)))
> )
>   ) 
> }
> {code}
> credits: 
> [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23074) Dataframe-ified zipwithindex

2018-01-15 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326455#comment-16326455
 ] 

Ruslan Dautkhanov edited comment on SPARK-23074 at 1/15/18 5:40 PM:


{quote}You can create a DataFrame from the result of .zipWithIndex on an RDD, 
as you see here.
{quote}
That's very kludgy as you can see the code snippet above. A direct 
functionality for this RDD-level api would be super awesome to have.
{quote}There's already a rowNumber function in Spark SQL, however, which sounds 
like the native equivalent?
{quote}
That's not the same. rowNumber requires an expression to `order by` on. What if 
there is no such column? For example, we often 
 have files that we ingest into Spark and where we physical position of a 
record is meaningful how that record has to be processed. 
 I can give exact example if you're interested. zipWithIndex() actually the 
only one API call that preserves such information from 
 original source (even though it can be distributed into multiple partitions 
etc.).
 Also as folks in that stackoverflow question said, rowNumber approach is way 
slower (and it's not surprizing as it requires data sorting).


was (Author: tagar):
{quote}You can create a DataFrame from the result of .zipWithIndex on an RDD, 
as you see here.{quote}

That's very kludgy as you can see the code snippet above. A direct 
functionality for this RDD-level api would be super awesome to have.

{quote}There's already a rowNumber function in Spark SQL, however, which sounds 
like the native equivalent?{quote}

That's not the same. rowNumber requires an expression to `order by` on. What if 
there is no such column? For example, we often 
have files that we ingest into Spark and where we physical position of a record 
is meaningful how that record has to be processed. 
I can give exact example if you're necessary. zipWithIndex() actually the only 
one API call that preserves such information from 
original source (even though it can be distributed into multiple partitions 
etc.).
Also as folks in that stackoverflow question said, rowNumber approach is way 
slower (and it's not surprizing as it requires data sorting).



> Dataframe-ified zipwithindex
> 
>
> Key: SPARK-23074
> URL: https://issues.apache.org/jira/browse/SPARK-23074
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Minor
>  Labels: dataframe, rdd
>
> Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex():
> {code:java}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types.{LongType, StructField, StructType}
> import org.apache.spark.sql.Row
> def dfZipWithIndex(
>   df: DataFrame,
>   offset: Int = 1,
>   colName: String = "id",
>   inFront: Boolean = true
> ) : DataFrame = {
>   df.sqlContext.createDataFrame(
> df.rdd.zipWithIndex.map(ln =>
>   Row.fromSeq(
> (if (inFront) Seq(ln._2 + offset) else Seq())
>   ++ ln._1.toSeq ++
> (if (inFront) Seq() else Seq(ln._2 + offset))
>   )
> ),
> StructType(
>   (if (inFront) Array(StructField(colName,LongType,false)) else 
> Array[StructField]()) 
> ++ df.schema.fields ++ 
>   (if (inFront) Array[StructField]() else 
> Array(StructField(colName,LongType,false)))
> )
>   ) 
> }
> {code}
> credits: 
> [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-01-15 Thread Roque Vassal'lo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326422#comment-16326422
 ] 

Roque Vassal'lo commented on SPARK-6305:


Hi,
I would like to know current status of this ticket.
Log4j 1 reached EOL two years ago, and current log4j 2 version (2.10) has 
compatibility with log4j 1 properties files (which I have read in this thread 
that it is a must).
I have run some tests using log4j's bridge and slf4j, and it seems to work fine 
and it does not need properties files to be modified.
Will this be a good approach to keep compatibility while getting new log4j 2 
functionality and bug-fixing?

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23081) Add colRegex API to PySpark

2018-01-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23081:

Summary: Add colRegex API to PySpark  (was: Add colRegex to PySpark)

> Add colRegex API to PySpark
> ---
>
> Key: SPARK-23081
> URL: https://issues.apache.org/jira/browse/SPARK-23081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23081) Add colRegex to PySpark

2018-01-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-23081:
---

 Summary: Add colRegex to PySpark
 Key: SPARK-23081
 URL: https://issues.apache.org/jira/browse/SPARK-23081
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Xiao Li
 Fix For: 2.3.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23081) Add colRegex to PySpark

2018-01-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23081:

Fix Version/s: (was: 2.3.0)

> Add colRegex to PySpark
> ---
>
> Key: SPARK-23081
> URL: https://issues.apache.org/jira/browse/SPARK-23081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12139) REGEX Column Specification for Hive Queries

2018-01-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-12139:
---

Assignee: jane

> REGEX Column Specification for Hive Queries
> ---
>
> Key: SPARK-12139
> URL: https://issues.apache.org/jira/browse/SPARK-12139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Derek Sabry
>Assignee: jane
>Priority: Minor
> Fix For: 2.3.0
>
>
> When executing a query of the form
> Select `(a)?\+.\+` from A,
> Hive would interpret this query as a regular expression, which can be 
> supported in the hive parser for spark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-01-15 Thread Oz Ben-Ami (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326390#comment-16326390
 ] 

Oz Ben-Ami commented on SPARK-23078:


[~mgaido] no objections, Kubernetes is the one that most needs it anyway.

> Allow Submitting Spark Thrift Server in Cluster Mode
> 
>
> Key: SPARK-23078
> URL: https://issues.apache.org/jira/browse/SPARK-23078
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running 
> in Cluster mode, since at the time it was not able to do so successfully. I 
> have confirmed that Spark Thrift Server can run on Cluster mode in 
> Kubernetes, by commenting out 
> [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
>  I have not had a chance to test against YARN. Since Kubernetes does not have 
> Client mode, this change is necessary to run Spark Thrift Service in 
> Kubernetes.
> [~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12139) REGEX Column Specification for Hive Queries

2018-01-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-12139.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> REGEX Column Specification for Hive Queries
> ---
>
> Key: SPARK-12139
> URL: https://issues.apache.org/jira/browse/SPARK-12139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Derek Sabry
>Priority: Minor
> Fix For: 2.3.0
>
>
> When executing a query of the form
> Select `(a)?\+.\+` from A,
> Hive would interpret this query as a regular expression, which can be 
> supported in the hive parser for spark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23080) Improve error message for built-in functions

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23080:


Assignee: (was: Apache Spark)

> Improve error message for built-in functions
> 
>
> Key: SPARK-23080
> URL: https://issues.apache.org/jira/browse/SPARK-23080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> When a user puts the wrong number of parameters in a function, an 
> AnalysisException is thrown. If the function is a UDF, he user is told how 
> many parameters the function expected and how many he/she put. If the 
> function, instead, is a built-in one, no information about the number of 
> parameters expected and the actual one is provided. This can help in some 
> cases, to debug the errors (eg. bad quotes escaping may lead to a different 
> number of parameters than expected, etc. etc.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23080) Improve error message for built-in functions

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23080:


Assignee: Apache Spark

> Improve error message for built-in functions
> 
>
> Key: SPARK-23080
> URL: https://issues.apache.org/jira/browse/SPARK-23080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Trivial
>
> When a user puts the wrong number of parameters in a function, an 
> AnalysisException is thrown. If the function is a UDF, he user is told how 
> many parameters the function expected and how many he/she put. If the 
> function, instead, is a built-in one, no information about the number of 
> parameters expected and the actual one is provided. This can help in some 
> cases, to debug the errors (eg. bad quotes escaping may lead to a different 
> number of parameters than expected, etc. etc.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23080) Improve error message for built-in functions

2018-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326383#comment-16326383
 ] 

Apache Spark commented on SPARK-23080:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20271

> Improve error message for built-in functions
> 
>
> Key: SPARK-23080
> URL: https://issues.apache.org/jira/browse/SPARK-23080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> When a user puts the wrong number of parameters in a function, an 
> AnalysisException is thrown. If the function is a UDF, he user is told how 
> many parameters the function expected and how many he/she put. If the 
> function, instead, is a built-in one, no information about the number of 
> parameters expected and the actual one is provided. This can help in some 
> cases, to debug the errors (eg. bad quotes escaping may lead to a different 
> number of parameters than expected, etc. etc.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22624) Expose range partitioning shuffle introduced by SPARK-22614

2018-01-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326382#comment-16326382
 ] 

Xiao Li commented on SPARK-22624:
-

cc [~xubo245] Do you want to take this?

> Expose range partitioning shuffle introduced by SPARK-22614
> ---
>
> Key: SPARK-22624
> URL: https://issues.apache.org/jira/browse/SPARK-22624
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Adrian Ionescu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-01-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326378#comment-16326378
 ] 

Marco Gaido commented on SPARK-23078:
-

[~ozzieba] I see that in Kubernetes it might work, but I think it make no sense 
on any other RM. I think the best option would be therefore to allow it only on 
Kubernetes.

> Allow Submitting Spark Thrift Server in Cluster Mode
> 
>
> Key: SPARK-23078
> URL: https://issues.apache.org/jira/browse/SPARK-23078
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running 
> in Cluster mode, since at the time it was not able to do so successfully. I 
> have confirmed that Spark Thrift Server can run on Cluster mode in 
> Kubernetes, by commenting out 
> [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
>  I have not had a chance to test against YARN. Since Kubernetes does not have 
> Client mode, this change is necessary to run Spark Thrift Service in 
> Kubernetes.
> [~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-01-15 Thread Oz Ben-Ami (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326366#comment-16326366
 ] 

Oz Ben-Ami edited comment on SPARK-23078 at 1/15/18 4:02 PM:
-

[~mgaido] In Kubernetes you can just create a Service (think in-cluster 
DNS+load balancing) which automatically connects to it, whichever node it's on


was (Author: ozzieba):
[~mgaido] In Kubernetes you can just create a Service which automatically 
connects to it, whichever node it's on

> Allow Submitting Spark Thrift Server in Cluster Mode
> 
>
> Key: SPARK-23078
> URL: https://issues.apache.org/jira/browse/SPARK-23078
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running 
> in Cluster mode, since at the time it was not able to do so successfully. I 
> have confirmed that Spark Thrift Server can run on Cluster mode in 
> Kubernetes, by commenting out 
> [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
>  I have not had a chance to test against YARN. Since Kubernetes does not have 
> Client mode, this change is necessary to run Spark Thrift Service in 
> Kubernetes.
> [~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-01-15 Thread Oz Ben-Ami (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326366#comment-16326366
 ] 

Oz Ben-Ami commented on SPARK-23078:


[~mgaido] In Kubernetes you can just create a Service which automatically 
connects to it, whichever node it's on

> Allow Submitting Spark Thrift Server in Cluster Mode
> 
>
> Key: SPARK-23078
> URL: https://issues.apache.org/jira/browse/SPARK-23078
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running 
> in Cluster mode, since at the time it was not able to do so successfully. I 
> have confirmed that Spark Thrift Server can run on Cluster mode in 
> Kubernetes, by commenting out 
> [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
>  I have not had a chance to test against YARN. Since Kubernetes does not have 
> Client mode, this change is necessary to run Spark Thrift Service in 
> Kubernetes.
> [~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23080) Improve error message for built-in functions

2018-01-15 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-23080:
---

 Summary: Improve error message for built-in functions
 Key: SPARK-23080
 URL: https://issues.apache.org/jira/browse/SPARK-23080
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


When a user puts the wrong number of parameters in a function, an 
AnalysisException is thrown. If the function is a UDF, he user is told how many 
parameters the function expected and how many he/she put. If the function, 
instead, is a built-in one, no information about the number of parameters 
expected and the actual one is provided. This can help in some cases, to debug 
the errors (eg. bad quotes escaping may lead to a different number of 
parameters than expected, etc. etc.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23050) Structured Streaming with S3 file source duplicates data because of eventual consistency.

2018-01-15 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326330#comment-16326330
 ] 

Steve Loughran edited comment on SPARK-23050 at 1/15/18 3:24 PM:
-

Quick review of the code

Yes, there's potentially a failure if a cached 404 is picked up in taskCommit.

It'd be slightly less brittle to return the array of URIs in the task commit 
message, have {{commitJob}} call getFileStatus() for each. That'd eliminate the 
problem except for any task committed immediately before commitJob & whose ref 
was still in the negative cache of the S3 load balances. It would also help 
catch the potential issue "file is lost between task commit and job commit". 
Even so, it'd be safe to do some little retry a bit ike ScalaTest;s 
{{eventually()}} to deal with that negative caching. it *should* only be for a 
few seconds, at worst (we don't have any real figures on it, it's usually so 
rarely seen, at least with the s3a client).

Following the commit Job code path, {{HDFSMetadataLog}} could be made object 
store aware, and opt for a direct (atomic) overwrite of the log, rather than 
the write to temp & rename. Without that, time to commit becomes O(files) 
rather than (1)

Oh, and the issue is the map in the commit job
{code:java}
  val statuses: Seq[SinkFileStatus] =
addedFiles.map(f => SinkFileStatus(fs.getFileStatus(new Path(f
{code}
Probing S3 for a single file can take a few hundred millis, as it's a single 
HEAD for a file, [more for a 
directory|https://steveloughran.blogspot.co.uk/2016/12/how-long-does-filesystemexists-take.html].
 And the more queries you make, the higher the risk of S3 throttling. Doing 
that in the jobCommit amplifies the risk of throttling.

More efficient would be to do a single list of the task working dir into a map, 
then use that to look up the files. approximately O(1) per directory, though as 
LIST consistency lags create consistency, it's actually less reliable.

Finally, ORC may not write a file if there's nothing there to write, I 
believe...the file not being there isn't going to be a failure case 
(implication: any busy-wait for a file coming into existence should not wait 
that long)

+[~fabbri], [~Thomas Demoor], [~ehiggs]


was (Author: ste...@apache.org):
Quick review of the code

Yes, there's potentially a failure if a cached 404 is picked up in taskCommit.

It'd be slightly less brittle to return the array of URIs in the task commit 
message, have  {{commitJob}} call getFileStatus() for each. That'd eliminate 
the problem except for any task committed immediately before commitJob & whose 
ref was still in the negative cache of the S3 load balances. It would also  
help catch the potential issue "file is lost between task commit and job 
commit". Even so, it'd be safe to do some little retry a bit ike ScalaTest;s 
{{eventually()}} to deal with that negative caching. it *should* only be for a 
few seconds, at worst (we don't have any real figures on it, it's usually so 
rarely seen, at least with  the s3a client).

Following the commit Job code path, {{HDFSMetadataLog}} could be made object 
store aware, and opt for a direct (atomic) overwrite of the log, rather than 
the write to temp & rename. Without that, time to commit becomes O(files) 
rather than (1)


> Structured Streaming with S3 file source duplicates data because of eventual 
> consistency.
> -
>
> Key: SPARK-23050
> URL: https://issues.apache.org/jira/browse/SPARK-23050
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yash Sharma
>Priority: Major
>
> Spark Structured streaming with S3 file source duplicates data because of 
> eventual consistency.
> Re producing the scenario -
> - Structured streaming reading from S3 source. Writing back to S3.
> - Spark tries to commitTask on completion of a task, by verifying if all the 
> files have been written to Filesystem. 
> {{ManifestFileCommitProtocol.commitTask}}.
> - [Eventual consistency issue] Spark finds that the file is not present and 
> fails the task. {{org.apache.spark.SparkException: Task failed while writing 
> rows. No such file or directory 
> 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'}}
> - By this time S3 eventually gets the file.
> - Spark reruns the task and completes the task, but gets a new file name this 
> time. {{ManifestFileCommitProtocol.newTaskTempFile. 
> part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet.}}
> - Data duplicates in results and the same data is processed twice and written 
> to S3.
> - There is no data duplication if spark is able to list presence of all 
> committed files and all tasks succeed.
>

[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-01-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326337#comment-16326337
 ] 

Marco Gaido commented on SPARK-23078:
-

The problem is: in cluster mode you don't control where the thriftserver is 
launched. So how do you connect to it? I am not sure it makes sense to run it 
in cluster mode.

> Allow Submitting Spark Thrift Server in Cluster Mode
> 
>
> Key: SPARK-23078
> URL: https://issues.apache.org/jira/browse/SPARK-23078
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running 
> in Cluster mode, since at the time it was not able to do so successfully. I 
> have confirmed that Spark Thrift Server can run on Cluster mode in 
> Kubernetes, by commenting out 
> [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
>  I have not had a chance to test against YARN. Since Kubernetes does not have 
> Client mode, this change is necessary to run Spark Thrift Service in 
> Kubernetes.
> [~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23079) Fix query constraints propagation with aliases

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23079:


Assignee: (was: Apache Spark)

> Fix query constraints propagation with aliases
> --
>
> Key: SPARK-23079
> URL: https://issues.apache.org/jira/browse/SPARK-23079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Previously, PR #19201 fix the problem of non-converging constraints.
> After that PR #19149 improve the loop and constraints is inferred only once.
> So the problem of non-converging constraints is gone.
> Also, in current code, the case below will fail.
> ```
> spark.range(5).write.saveAsTable("t")
> val t = spark.read.table("t")
> val left = t.withColumn("xid", $"id" + lit(1)).as("x")
> val right = t.withColumnRenamed("id", "xid").as("y")
> val df = left.join(right, "xid").filter("id = 3").toDF()
> checkAnswer(df, Row(4, 3))
> ```
> Because of `aliasMap` replace all the aliased child. See the test case in PR 
> for details.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23079) Fix query constraints propagation with aliases

2018-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326334#comment-16326334
 ] 

Apache Spark commented on SPARK-23079:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/20270

> Fix query constraints propagation with aliases
> --
>
> Key: SPARK-23079
> URL: https://issues.apache.org/jira/browse/SPARK-23079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Previously, PR #19201 fix the problem of non-converging constraints.
> After that PR #19149 improve the loop and constraints is inferred only once.
> So the problem of non-converging constraints is gone.
> Also, in current code, the case below will fail.
> ```
> spark.range(5).write.saveAsTable("t")
> val t = spark.read.table("t")
> val left = t.withColumn("xid", $"id" + lit(1)).as("x")
> val right = t.withColumnRenamed("id", "xid").as("y")
> val df = left.join(right, "xid").filter("id = 3").toDF()
> checkAnswer(df, Row(4, 3))
> ```
> Because of `aliasMap` replace all the aliased child. See the test case in PR 
> for details.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23079) Fix query constraints propagation with aliases

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23079:


Assignee: Apache Spark

> Fix query constraints propagation with aliases
> --
>
> Key: SPARK-23079
> URL: https://issues.apache.org/jira/browse/SPARK-23079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Previously, PR #19201 fix the problem of non-converging constraints.
> After that PR #19149 improve the loop and constraints is inferred only once.
> So the problem of non-converging constraints is gone.
> Also, in current code, the case below will fail.
> ```
> spark.range(5).write.saveAsTable("t")
> val t = spark.read.table("t")
> val left = t.withColumn("xid", $"id" + lit(1)).as("x")
> val right = t.withColumnRenamed("id", "xid").as("y")
> val df = left.join(right, "xid").filter("id = 3").toDF()
> checkAnswer(df, Row(4, 3))
> ```
> Because of `aliasMap` replace all the aliased child. See the test case in PR 
> for details.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view

2018-01-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23035.
-
   Resolution: Fixed
 Assignee: xubo245
Fix Version/s: 2.3.0

> Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
> TempViewAlreadyExistsException when create temp view
> --
>
> Key: SPARK-23035
> URL: https://issues.apache.org/jira/browse/SPARK-23035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Fix For: 2.3.0
>
>
> Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
> TempViewAlreadyExistsException when create temp view
> There are warning when run test:   test("rename temporary view - destination 
> table with database name") 
> {code:java}
> 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
> TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
> ... USING ... instead
> {code}
> other test cases also have this warning
> Another problem, it throw TempTableAlreadyExistsException and output 
> "Temporary table '$table' already exists" when we create temp view by using 
> org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
> improper.
> {code:java}
>  /**
>* Creates a global temp view, or issue an exception if the view already 
> exists and
>* `overrideIfExists` is false.
>*/
>   def create(
>   name: String,
>   viewDefinition: LogicalPlan,
>   overrideIfExists: Boolean): Unit = synchronized {
> if (!overrideIfExists && viewDefinitions.contains(name)) {
>   throw new TempTableAlreadyExistsException(name)
> }
> viewDefinitions.put(name, viewDefinition)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23050) Structured Streaming with S3 file source duplicates data because of eventual consistency.

2018-01-15 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326330#comment-16326330
 ] 

Steve Loughran commented on SPARK-23050:


Quick review of the code

Yes, there's potentially a failure if a cached 404 is picked up in taskCommit.

It'd be slightly less brittle to return the array of URIs in the task commit 
message, have  {{commitJob}} call getFileStatus() for each. That'd eliminate 
the problem except for any task committed immediately before commitJob & whose 
ref was still in the negative cache of the S3 load balances. It would also  
help catch the potential issue "file is lost between task commit and job 
commit". Even so, it'd be safe to do some little retry a bit ike ScalaTest;s 
{{eventually()}} to deal with that negative caching. it *should* only be for a 
few seconds, at worst (we don't have any real figures on it, it's usually so 
rarely seen, at least with  the s3a client).

Following the commit Job code path, {{HDFSMetadataLog}} could be made object 
store aware, and opt for a direct (atomic) overwrite of the log, rather than 
the write to temp & rename. Without that, time to commit becomes O(files) 
rather than (1)


> Structured Streaming with S3 file source duplicates data because of eventual 
> consistency.
> -
>
> Key: SPARK-23050
> URL: https://issues.apache.org/jira/browse/SPARK-23050
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yash Sharma
>Priority: Major
>
> Spark Structured streaming with S3 file source duplicates data because of 
> eventual consistency.
> Re producing the scenario -
> - Structured streaming reading from S3 source. Writing back to S3.
> - Spark tries to commitTask on completion of a task, by verifying if all the 
> files have been written to Filesystem. 
> {{ManifestFileCommitProtocol.commitTask}}.
> - [Eventual consistency issue] Spark finds that the file is not present and 
> fails the task. {{org.apache.spark.SparkException: Task failed while writing 
> rows. No such file or directory 
> 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'}}
> - By this time S3 eventually gets the file.
> - Spark reruns the task and completes the task, but gets a new file name this 
> time. {{ManifestFileCommitProtocol.newTaskTempFile. 
> part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet.}}
> - Data duplicates in results and the same data is processed twice and written 
> to S3.
> - There is no data duplication if spark is able to list presence of all 
> committed files and all tasks succeed.
> Code:
> {code}
> query = selected_df.writeStream \
> .format("parquet") \
> .option("compression", "snappy") \
> .option("path", "s3://path/data/") \
> .option("checkpointLocation", "s3://path/checkpoint/") \
> .start()
> {code}
> Same sized duplicate S3 Files:
> {code}
> $ aws s3 ls s3://path/data/ | grep part-00256
> 2018-01-11 03:37:00  17070 
> part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet
> 2018-01-11 03:37:10  17070 
> part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet
> {code}
> Exception on S3 listing and task failure:
> {code}
> [Stage 5:>(277 + 100) / 
> 597]18/01/11 03:36:59 WARN TaskSetManager: Lost task 256.0 in stage 5.0 (TID  
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.io.FileNotFoundException: No such file or directory 
> 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:816)
>   at 
>

[jira] [Created] (SPARK-23079) Fix query constraints propagation with aliases

2018-01-15 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-23079:
--

 Summary: Fix query constraints propagation with aliases
 Key: SPARK-23079
 URL: https://issues.apache.org/jira/browse/SPARK-23079
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Gengliang Wang


Previously, PR #19201 fix the problem of non-converging constraints.

After that PR #19149 improve the loop and constraints is inferred only once.

So the problem of non-converging constraints is gone.

Also, in current code, the case below will fail.

```

spark.range(5).write.saveAsTable("t")
val t = spark.read.table("t")
val left = t.withColumn("xid", $"id" + lit(1)).as("x")
val right = t.withColumnRenamed("id", "xid").as("y")
val df = left.join(right, "xid").filter("id = 3").toDF()
checkAnswer(df, Row(4, 3))

```

Because of `aliasMap` replace all the aliased child. See the test case in PR 
for details.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23076) When we call cache() on RDD which depends on ShuffleRowRDD, we will get an error result

2018-01-15 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326304#comment-16326304
 ] 

Wenchen Fan commented on SPARK-23076:
-

To be clear, you maintain an internal Spark version and cache the 
MapPartitionsRDD by changing the Spark SQL code base? Then this is not a bug...

> When we call cache() on RDD which depends on ShuffleRowRDD, we will get an 
> error result
> ---
>
> Key: SPARK-23076
> URL: https://issues.apache.org/jira/browse/SPARK-23076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Major
> Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Joe|20|
> |Tom|30|
> |Hyukjin|25|
> ++++--
>  However,when we call cache on MapPartitionsRDD below:
>   !shufflerowrdd-cache.png!
> Then result will be error:
>  0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Hyukjin|25|
> |Hyukjin|25|
> |Hyukjin|25|
> ++++--
>  The reason why this happen is that:
> UnsafeRow which generated by ShuffleRowRDD#compute will use the same under 
> byte buffer
> I print some log below to explain this:
> Modify UnsafeRow.toString()
> {code:java}
> // This is for debugging
> @Override
> public String toString() {
>   StringBuilder build = new StringBuilder("[");
>   for (int i = 0; i < sizeInBytes; i += 8) {
> if (i != 0) build.append(',');
> build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
> baseOffset + i)));
>   }
>   build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
> baseOffset here
>   return build.toString();
> }{code}
> {code:java}
> 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
> 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
> 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
> {code}
> we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
> iterator when config is true,like below:
> {code:java}
> reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
> if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
>   && x._2.isInstanceOf[UnsafeRow]) {
>   (x._2).asInstanceOf[UnsafeRow].copy()
> } else {
>   x._2
> }
>   })
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode

2018-01-15 Thread Oz Ben-Ami (JIRA)

Oz Ben-Ami created SPARK-23078:
--

 Summary: Allow Submitting Spark Thrift Server in Cluster Mode
 Key: SPARK-23078
 URL: https://issues.apache.org/jira/browse/SPARK-23078
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Submit, SQL
Affects Versions: 2.2.1, 2.2.0, 2.3.0
Reporter: Oz Ben-Ami


Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running in 
Cluster mode, since at the time it was not able to do so successfully. I have 
confirmed that Spark Thrift Server can run on Cluster mode in Kubernetes, by 
commenting out 
[https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.]
 I have not had a chance to test against YARN. Since Kubernetes does not have 
Client mode, this change is necessary to run Spark Thrift Service in Kubernetes.

[~foxish] [~coderanger]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23070) Bump previousSparkVersion in MimaBuild.scala to be 2.2.0

2018-01-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23070.
-
Resolution: Fixed

> Bump previousSparkVersion in MimaBuild.scala to be 2.2.0
> 
>
> Key: SPARK-23070
> URL: https://issues.apache.org/jira/browse/SPARK-23070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-01-15 Thread JP Moresmau (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326303#comment-16326303
 ] 

JP Moresmau commented on SPARK-18112:
-

I'm using Hive 2.3.2 and Spark 2.2.1, but I still run into this issue. Is there 
any specific configuration setting I should look for?

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22995) Spark UI stdout/stderr links point to executors internal address

2018-01-15 Thread Jhon Cardenas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jhon Cardenas resolved SPARK-22995.
---
Resolution: Unresolved

> Spark UI stdout/stderr links point to executors internal address
> 
>
> Key: SPARK-22995
> URL: https://issues.apache.org/jira/browse/SPARK-22995
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
> Environment: AWS EMR, yarn cluster.
>Reporter: Jhon Cardenas
>Priority: Major
> Attachments: link.jpeg
>
>
> On Spark ui, in Environment and Executors tabs, the links of stdout and 
> stderr point to the internal address of the executors. This would imply to 
> expose the executors so that links can be accessed. Shouldn't those links be 
> pointed to master then handled internally serving the master as a proxy for 
> these files instead of exposing the internal machines?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23029:


Assignee: (was: Apache Spark)

> Doc spark.shuffle.file.buffer units are kb when no units specified
> --
>
> Key: SPARK-23029
> URL: https://issues.apache.org/jira/browse/SPARK-23029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Priority: Minor
>
> When setting the spark.shuffle.file.buffer setting, even to its default 
> value, shuffles fail.
> This appears to affect small to medium size partitions. Strangely the error 
> message is OutOfMemoryError, but it works with large partitions (at least 
> >32MB).
> {code}
> pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
> /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit
>  pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
> version 2.2.1
> >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", 
> >>> mode="overwrite")
> [Stage 1:>(0 + 10) / 
> 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 11)
> java.lang.OutOfMemoryError: Java heap space
>   at java.io.BufferedOutputStream.(BufferedOutputStream.java:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified

2018-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326260#comment-16326260
 ] 

Apache Spark commented on SPARK-23029:
--

User 'ferdonline' has created a pull request for this issue:
https://github.com/apache/spark/pull/20269

> Doc spark.shuffle.file.buffer units are kb when no units specified
> --
>
> Key: SPARK-23029
> URL: https://issues.apache.org/jira/browse/SPARK-23029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Priority: Minor
>
> When setting the spark.shuffle.file.buffer setting, even to its default 
> value, shuffles fail.
> This appears to affect small to medium size partitions. Strangely the error 
> message is OutOfMemoryError, but it works with large partitions (at least 
> >32MB).
> {code}
> pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
> /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit
>  pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
> version 2.2.1
> >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", 
> >>> mode="overwrite")
> [Stage 1:>(0 + 10) / 
> 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 11)
> java.lang.OutOfMemoryError: Java heap space
>   at java.io.BufferedOutputStream.(BufferedOutputStream.java:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified

2018-01-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23029:


Assignee: Apache Spark

> Doc spark.shuffle.file.buffer units are kb when no units specified
> --
>
> Key: SPARK-23029
> URL: https://issues.apache.org/jira/browse/SPARK-23029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Assignee: Apache Spark
>Priority: Minor
>
> When setting the spark.shuffle.file.buffer setting, even to its default 
> value, shuffles fail.
> This appears to affect small to medium size partitions. Strangely the error 
> message is OutOfMemoryError, but it works with large partitions (at least 
> >32MB).
> {code}
> pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
> /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit
>  pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
> version 2.2.1
> >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", 
> >>> mode="overwrite")
> [Stage 1:>(0 + 10) / 
> 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 11)
> java.lang.OutOfMemoryError: Java heap space
>   at java.io.BufferedOutputStream.(BufferedOutputStream.java:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21856:
--

Assignee: Chunsheng Ji

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Chunsheng Ji
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21856:
--

Assignee: (was: Weichen Xu)

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21856:
--

Assignee: Weichen Xu

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21856.

Resolution: Fixed

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23076) When we call cache() on RDD which depends on ShuffleRowRDD, we will get an error result

2018-01-15 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-23076:
-
Summary: When we call cache() on RDD which depends on ShuffleRowRDD, we 
will get an error result  (was: When we call cache() on ShuffleRowRDD, we will 
get an error result)

> When we call cache() on RDD which depends on ShuffleRowRDD, we will get an 
> error result
> ---
>
> Key: SPARK-23076
> URL: https://issues.apache.org/jira/browse/SPARK-23076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Major
> Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Joe|20|
> |Tom|30|
> |Hyukjin|25|
> ++++--
>  However,when we call cache on MapPartitionsRDD below:
>   !shufflerowrdd-cache.png!
> Then result will be error:
>  0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Hyukjin|25|
> |Hyukjin|25|
> |Hyukjin|25|
> ++++--
>  The reason why this happen is that:
> UnsafeRow which generated by ShuffleRowRDD#compute will use the same under 
> byte buffer
> I print some log below to explain this:
> Modify UnsafeRow.toString()
> {code:java}
> // This is for debugging
> @Override
> public String toString() {
>   StringBuilder build = new StringBuilder("[");
>   for (int i = 0; i < sizeInBytes; i += 8) {
> if (i != 0) build.append(',');
> build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
> baseOffset + i)));
>   }
>   build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
> baseOffset here
>   return build.toString();
> }{code}
> {code:java}
> 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
> 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
> 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
> {code}
> we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
> iterator when config is true,like below:
> {code:java}
> reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
> if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
>   && x._2.isInstanceOf[UnsafeRow]) {
>   (x._2).asInstanceOf[UnsafeRow].copy()
> } else {
>   x._2
> }
>   })
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result

2018-01-15 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326242#comment-16326242
 ] 

zhoukang edited comment on SPARK-23076 at 1/15/18 1:34 PM:
---

As the picture i posted, i cached MapPartitionsRDD which depends on 
ShuffleRowRDD.

We used this for our internal spark sql service [~cloud_fan]


was (Author: cane):
As the picture i posted, i cached MapPartitionsRDD which depends on 
ShuffleRowRDD.

We used this our internal spark sql service [~cloud_fan]

> When we call cache() on ShuffleRowRDD, we will get an error result
> --
>
> Key: SPARK-23076
> URL: https://issues.apache.org/jira/browse/SPARK-23076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Major
> Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Joe|20|
> |Tom|30|
> |Hyukjin|25|
> ++++--
>  However,when we call cache on MapPartitionsRDD below:
>   !shufflerowrdd-cache.png!
> Then result will be error:
>  0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Hyukjin|25|
> |Hyukjin|25|
> |Hyukjin|25|
> ++++--
>  The reason why this happen is that:
> UnsafeRow which generated by ShuffleRowRDD#compute will use the same under 
> byte buffer
> I print some log below to explain this:
> Modify UnsafeRow.toString()
> {code:java}
> // This is for debugging
> @Override
> public String toString() {
>   StringBuilder build = new StringBuilder("[");
>   for (int i = 0; i < sizeInBytes; i += 8) {
> if (i != 0) build.append(',');
> build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
> baseOffset + i)));
>   }
>   build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
> baseOffset here
>   return build.toString();
> }{code}
> {code:java}
> 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
> 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
> 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
> {code}
> we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
> iterator when config is true,like below:
> {code:java}
> reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
> if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
>   && x._2.isInstanceOf[UnsafeRow]) {
>   (x._2).asInstanceOf[UnsafeRow].copy()
> } else {
>   x._2
> }
>   })
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result

2018-01-15 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326242#comment-16326242
 ] 

zhoukang commented on SPARK-23076:
--

As the picture i posted, i cached MapPartitionsRDD which depends on 
ShuffleRowRDD.

We used this our internal spark sql service [~cloud_fan]

> When we call cache() on ShuffleRowRDD, we will get an error result
> --
>
> Key: SPARK-23076
> URL: https://issues.apache.org/jira/browse/SPARK-23076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Major
> Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Joe|20|
> |Tom|30|
> |Hyukjin|25|
> ++++--
>  However,when we call cache on MapPartitionsRDD below:
>   !shufflerowrdd-cache.png!
> Then result will be error:
>  0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
>  ++++--
> |_c0|_c1|
> ++++--
> |Hyukjin|25|
> |Hyukjin|25|
> |Hyukjin|25|
> ++++--
>  The reason why this happen is that:
> UnsafeRow which generated by ShuffleRowRDD#compute will use the same under 
> byte buffer
> I print some log below to explain this:
> Modify UnsafeRow.toString()
> {code:java}
> // This is for debugging
> @Override
> public String toString() {
>   StringBuilder build = new StringBuilder("[");
>   for (int i = 0; i < sizeInBytes; i += 8) {
> if (i != 0) build.append(',');
> build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
> baseOffset + i)));
>   }
>   build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
> baseOffset here
>   return build.toString();
> }{code}
> {code:java}
> 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
> 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
> 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
> {code}
> we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
> iterator when config is true,like below:
> {code:java}
> reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
> if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
>   && x._2.isInstanceOf[UnsafeRow]) {
>   (x._2).asInstanceOf[UnsafeRow].copy()
> } else {
>   x._2
> }
>   })
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result

2018-01-15 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-23076:
-
Description: 
For query below:
{code:java}
select * from csv_demo limit 3;
{code}
The correct result should be:
 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
 ++++--
|_c0|_c1|

++++--
|Joe|20|
|Tom|30|
|Hyukjin|25|

++++--
 However,when we call cache on MapPartitionsRDD below:

  !shufflerowrdd-cache.png!

Then result will be error:
 0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
 ++++--
|_c0|_c1|

++++--
|Hyukjin|25|
|Hyukjin|25|
|Hyukjin|25|

++++--
 The reason why this happen is that:

UnsafeRow which generated by ShuffleRowRDD#compute will use the same under byte 
buffer

I print some log below to explain this:

Modify UnsafeRow.toString()
{code:java}
// This is for debugging
@Override
public String toString() {
  StringBuilder build = new StringBuilder("[");
  for (int i = 0; i < sizeInBytes; i += 8) {
if (i != 0) build.append(',');
build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
baseOffset + i)));
  }
  build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
baseOffset here
  return build.toString();
}{code}
{code:java}
2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
{code}
we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
iterator when config is true,like below:
{code:java}
reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
  && x._2.isInstanceOf[UnsafeRow]) {
  (x._2).asInstanceOf[UnsafeRow].copy()
} else {
  x._2
}
  })
}
{code}

  was:
For query below:
{code:java}
select * from csv_demo limit 3;
{code}
The correct result should be:
 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
 +---+-++--
|_c0|_c1|

+---+-++--
|Joe|20|
|Tom|30|
|Hyukjin|25|

+---+-++--
 However,when we call cache on ShuffleRowRDD(or RDD which depends on 
ShuffleRowRDD in one stage):

  !shufflerowrdd-cache.png!

Then result will be error:
 0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
 +---+-++--
|_c0|_c1|

+---+-++--
|Hyukjin|25|
|Hyukjin|25|
|Hyukjin|25|

+---+-++--
 The reason why this happen is that:

UnsafeRow which generated by ShuffleRowRDD#compute will use the same under byte 
buffer

I print some log below to explain this:

Modify UnsafeRow.toString()
{code:java}
// This is for debugging
@Override
public String toString() {
  StringBuilder build = new StringBuilder("[");
  for (int i = 0; i < sizeInBytes; i += 8) {
if (i != 0) build.append(',');
build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
baseOffset + i)));
  }
  build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
baseOffset here
  return build.toString();
}{code}
{code:java}
2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
{code}
we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
iterator when config is true,like below:
{code:java}
reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
  && x._2.isInstanceOf[UnsafeRow]) {
  (x._2).asInstanceOf[UnsafeRow].copy()
} else {
  x._2
}
  })
}
{code}


> When we call cache() on ShuffleRowRDD, we will get an error result
> --
>
> Key: SPARK-23076
> URL: https://issues.apache.org/jira/browse/SPARK-23076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Major
> Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:1/> select * from

[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2018-01-15 Thread Fernando Pereira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326195#comment-16326195
 ] 

Fernando Pereira edited comment on SPARK-17998 at 1/15/18 12:35 PM:


[~sams] Did you have the change to check with/without sql to confirm this?

I don't know where you got it from, but with me only 
{{spark.sql.files.maxPartitionBytes}} worked


was (Author: ferdonline):
[~sams] Did you have the change to check with/without sql to confirm this?

I don't know where you got it from, but with me only 
{{park.sql.files.maxPartitionBytes}} worked

> Reading Parquet files coalesces parts into too few in-memory partitions
> ---
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>Reporter: Shea Parkes
>Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=1, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2018-01-15 Thread Fernando Pereira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326195#comment-16326195
 ] 

Fernando Pereira commented on SPARK-17998:
--

[~sams] Did you have the change to check with/without sql to confirm this?

I don't know where you got it from, but with me only 
{{park.sql.files.maxPartitionBytes}} worked

> Reading Parquet files coalesces parts into too few in-memory partitions
> ---
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>Reporter: Shea Parkes
>Priority: Major
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=1, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result

2018-01-15 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326163#comment-16326163
 ] 

Wenchen Fan commented on SPARK-23076:
-

How did you cache ShuffleRowRDD? It's kind of an intermedia internal RDD 
created by Spark SQL, we don't expect end users to touch it.

> When we call cache() on ShuffleRowRDD, we will get an error result
> --
>
> Key: SPARK-23076
> URL: https://issues.apache.org/jira/browse/SPARK-23076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Major
> Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
>  +---+-++--
> |_c0|_c1|
> +---+-++--
> |Joe|20|
> |Tom|30|
> |Hyukjin|25|
> +---+-++--
>  However,when we call cache on ShuffleRowRDD(or RDD which depends on 
> ShuffleRowRDD in one stage):
>   !shufflerowrdd-cache.png!
> Then result will be error:
>  0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
>  +---+-++--
> |_c0|_c1|
> +---+-++--
> |Hyukjin|25|
> |Hyukjin|25|
> |Hyukjin|25|
> +---+-++--
>  The reason why this happen is that:
> UnsafeRow which generated by ShuffleRowRDD#compute will use the same under 
> byte buffer
> I print some log below to explain this:
> Modify UnsafeRow.toString()
> {code:java}
> // This is for debugging
> @Override
> public String toString() {
>   StringBuilder build = new StringBuilder("[");
>   for (int i = 0; i < sizeInBytes; i += 8) {
> if (i != 0) build.append(',');
> build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
> baseOffset + i)));
>   }
>   build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
> baseOffset here
>   return build.toString();
> }{code}
> {code:java}
> 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
> 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
> 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
> {code}
> we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
> iterator when config is true,like below:
> {code:java}
> reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
> if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
>   && x._2.isInstanceOf[UnsafeRow]) {
>   (x._2).asInstanceOf[UnsafeRow].copy()
> } else {
>   x._2
> }
>   })
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23077) Apache Structured Streaming: Bulk/Batch write support for Hive using streaming dataset

2018-01-15 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23077.
---
Resolution: Invalid

I'm not sure what you're reporting here, but this sounds like a question about 
your code. Start on the mailing list or StackOverflow. Jira is for clearly 
described problems or changes, and ideally, a proposed solution.

> Apache Structured Streaming: Bulk/Batch write support for Hive using 
> streaming dataset
> --
>
> Key: SPARK-23077
> URL: https://issues.apache.org/jira/browse/SPARK-23077
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Pravin Agrawal
>Priority: Minor
>
> Using Apache Spark 2.2: Structured Streaming, Create a program which reads 
> data from Kafka and write it to Hive.
> Data incoming from Kafka topic @ 100 records/sec and write to Hive table.
> **Hive Table Created:**
> CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, 
> stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ;
> **Insert via Manual Hive Query:**
> INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);
> **Insert via spark structured streaming code:**
> SparkConf conf = new SparkConf();
>  conf.setAppName("testing");
>  conf.setMaster("local[2]");
>  conf.set("hive.metastore.uris", "thrift://localhost:9083");
>  SparkSession session = 
>  SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
> // workaround START: code to insert static data into hive
>  String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
> 'pravin', true)";
>  session.sql(insertQuery);
>  // workaround END:
> // Solution START
>  Dataset dataset = readFromKafka(sparkSession); // private method 
> reading data from Kafka's 'xyz' topic
> // some code which writes dataset into hive table demo_user
>  // Solution END



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Bulk/Batch write support for Hive using streaming dataset

2018-01-15 Thread Pravin Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Agrawal updated SPARK-23077:
---
Issue Type: Improvement  (was: Bug)

> Apache Structured Streaming: Bulk/Batch write support for Hive using 
> streaming dataset
> --
>
> Key: SPARK-23077
> URL: https://issues.apache.org/jira/browse/SPARK-23077
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Pravin Agrawal
>Priority: Minor
>
> Using Apache Spark 2.2: Structured Streaming, Create a program which reads 
> data from Kafka and write it to Hive.
> Data incoming from Kafka topic @ 100 records/sec and write to Hive table.
> **Hive Table Created:**
> CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, 
> stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ;
> **Insert via Manual Hive Query:**
> INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);
> **Insert via spark structured streaming code:**
> SparkConf conf = new SparkConf();
>  conf.setAppName("testing");
>  conf.setMaster("local[2]");
>  conf.set("hive.metastore.uris", "thrift://localhost:9083");
>  SparkSession session = 
>  SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
> // workaround START: code to insert static data into hive
>  String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
> 'pravin', true)";
>  session.sql(insertQuery);
>  // workaround END:
> // Solution START
>  Dataset dataset = readFromKafka(sparkSession); // private method 
> reading data from Kafka's 'xyz' topic
> // some code which writes dataset into hive table demo_user
>  // Solution END



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result

2018-01-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326153#comment-16326153
 ] 

Sean Owen commented on SPARK-23076:
---

That sounds serious, although if true, I would expect a lot of stuff fails.

Where are you calling cache by the way?

Maybe CC [~cloud_fan]

> When we call cache() on ShuffleRowRDD, we will get an error result
> --
>
> Key: SPARK-23076
> URL: https://issues.apache.org/jira/browse/SPARK-23076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Major
> Attachments: shufflerowrdd-cache.png
>
>
> For query below:
> {code:java}
> select * from csv_demo limit 3;
> {code}
> The correct result should be:
>  0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3;
>  +---+-++--
> |_c0|_c1|
> +---+-++--
> |Joe|20|
> |Tom|30|
> |Hyukjin|25|
> +---+-++--
>  However,when we call cache on ShuffleRowRDD(or RDD which depends on 
> ShuffleRowRDD in one stage):
>   !shufflerowrdd-cache.png!
> Then result will be error:
>  0: jdbc:hive2://xxx/> select * from csv_demo limit 3;
>  +---+-++--
> |_c0|_c1|
> +---+-++--
> |Hyukjin|25|
> |Hyukjin|25|
> |Hyukjin|25|
> +---+-++--
>  The reason why this happen is that:
> UnsafeRow which generated by ShuffleRowRDD#compute will use the same under 
> byte buffer
> I print some log below to explain this:
> Modify UnsafeRow.toString()
> {code:java}
> // This is for debugging
> @Override
> public String toString() {
>   StringBuilder build = new StringBuilder("[");
>   for (int i = 0; i < sizeInBytes; i += 8) {
> if (i != 0) build.append(',');
> build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, 
> baseOffset + i)));
>   }
>   build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and 
> baseOffset here
>   return build.toString();
> }{code}
> {code:java}
> 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16]
> 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16]
> 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: 
> Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16]
> {code}
> we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD 
> iterator when config is true,like below:
> {code:java}
> reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => {
> if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE)
>   && x._2.isInstanceOf[UnsafeRow]) {
>   (x._2).asInstanceOf[UnsafeRow].copy()
> } else {
>   x._2
> }
>   })
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-01-15 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326151#comment-16326151
 ] 

Nick Pentreath commented on SPARK-22943:


Does the new estimator & model version of OHE solve this underlying issue? 

> OneHotEncoder supports manual specification of categorySizes
> 
>
> Key: SPARK-22943
> URL: https://issues.apache.org/jira/browse/SPARK-22943
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> OHE should support configurable categorySizes, as n-values in  
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
>  which allows consistent and foreseeable conversion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Bulk/Batch write support for Hive using streaming dataset

2018-01-15 Thread Pravin Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Agrawal updated SPARK-23077:
---
Description: 
Using Apache Spark 2.2: Structured Streaming, Create a program which reads data 
from Kafka and write it to Hive.
Data incoming from Kafka topic @ 100 records/sec and write to Hive table.

**Hive Table Created:**

CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, stringdd 
STRING, booleanee BOOLEAN ) STORED AS ORC ;

**Insert via Manual Hive Query:**

INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);

**Insert via spark structured streaming code:**

SparkConf conf = new SparkConf();
 conf.setAppName("testing");
 conf.setMaster("local[2]");
 conf.set("hive.metastore.uris", "thrift://localhost:9083");
 SparkSession session = 
 SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();

// workaround START: code to insert static data into hive
 String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
'pravin', true)";
 session.sql(insertQuery);
 // workaround END:

// Solution START
 Dataset dataset = readFromKafka(sparkSession); // private method reading 
data from Kafka's 'xyz' topic

// some code which writes dataset into hive table demo_user
 // Solution END

  was:
Using Apache Spark 2.2: Structured Streaming, I am creating a program which 
reads data from Kafka and write it to Hive.
 I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec.

**Hive Table Created:**

CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, stringdd 
STRING, booleanee BOOLEAN ) STORED AS ORC ;

**Insert via Manual Hive Query:**

INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);

**Insert via spark structured streaming code:**

SparkConf conf = new SparkConf();
 conf.setAppName("testing");
 conf.setMaster("local[2]");
 conf.set("hive.metastore.uris", "thrift://localhost:9083");
 SparkSession session = 
 SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();

// workaround START: code to insert static data into hive
 String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
'pravin', true)";
 session.sql(insertQuery);
 // workaround END:

// Solution START
 Dataset dataset = readFromKafka(sparkSession); // private method reading 
data from Kafka's 'xyz' topic

// some code which writes dataset into hive table demo_user
 // Solution END


> Apache Structured Streaming: Bulk/Batch write support for Hive using 
> streaming dataset
> --
>
> Key: SPARK-23077
> URL: https://issues.apache.org/jira/browse/SPARK-23077
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Pravin Agrawal
>Priority: Minor
>
> Using Apache Spark 2.2: Structured Streaming, Create a program which reads 
> data from Kafka and write it to Hive.
> Data incoming from Kafka topic @ 100 records/sec and write to Hive table.
> **Hive Table Created:**
> CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, 
> stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ;
> **Insert via Manual Hive Query:**
> INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);
> **Insert via spark structured streaming code:**
> SparkConf conf = new SparkConf();
>  conf.setAppName("testing");
>  conf.setMaster("local[2]");
>  conf.set("hive.metastore.uris", "thrift://localhost:9083");
>  SparkSession session = 
>  SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
> // workaround START: code to insert static data into hive
>  String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
> 'pravin', true)";
>  session.sql(insertQuery);
>  // workaround END:
> // Solution START
>  Dataset dataset = readFromKafka(sparkSession); // private method 
> reading data from Kafka's 'xyz' topic
> // some code which writes dataset into hive table demo_user
>  // Solution END



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Bulk/Batch write support for Hive using streaming dataset

2018-01-15 Thread Pravin Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Agrawal updated SPARK-23077:
---
Summary: Apache Structured Streaming: Bulk/Batch write support for Hive 
using streaming dataset  (was: Apache Structured Streaming: Unable to write 
streaming dataset into Hive)

> Apache Structured Streaming: Bulk/Batch write support for Hive using 
> streaming dataset
> --
>
> Key: SPARK-23077
> URL: https://issues.apache.org/jira/browse/SPARK-23077
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Pravin Agrawal
>Priority: Minor
>
> Using Apache Spark 2.2: Structured Streaming, I am creating a program which 
> reads data from Kafka and write it to Hive.
>  I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec.
> **Hive Table Created:**
> CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, 
> stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ;
> **Insert via Manual Hive Query:**
> INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);
> **Insert via spark structured streaming code:**
> SparkConf conf = new SparkConf();
>  conf.setAppName("testing");
>  conf.setMaster("local[2]");
>  conf.set("hive.metastore.uris", "thrift://localhost:9083");
>  SparkSession session = 
>  SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
> // workaround START: code to insert static data into hive
>  String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
> 'pravin', true)";
>  session.sql(insertQuery);
>  // workaround END:
> // Solution START
>  Dataset dataset = readFromKafka(sparkSession); // private method 
> reading data from Kafka's 'xyz' topic
> // some code which writes dataset into hive table demo_user
>  // Solution END



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22993) checkpointInterval param doc should be clearer

2018-01-15 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-22993.

Resolution: Fixed

> checkpointInterval param doc should be clearer
> --
>
> Key: SPARK-22993
> URL: https://issues.apache.org/jira/browse/SPARK-22993
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Trivial
>
> several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, 
> LDA, GBT), each of which silently ignores the parameter when the checkpoint 
> directory is not set on the spark context. This should be documented in the 
> param doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Unable to write streaming dataset into Hive

2018-01-15 Thread Pravin Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Agrawal updated SPARK-23077:
---
Description: 
Using Apache Spark 2.2: Structured Streaming, I am creating a program which 
reads data from Kafka and write it to Hive.
 I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec.

**Hive Table Created:**

CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, stringdd 
STRING, booleanee BOOLEAN ) STORED AS ORC ;

**Insert via Manual Hive Query:**

INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);

**Insert via spark structured streaming code:**

SparkConf conf = new SparkConf();
 conf.setAppName("testing");
 conf.setMaster("local[2]");
 conf.set("hive.metastore.uris", "thrift://localhost:9083");
 SparkSession session = 
 SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();

// workaround START: code to insert static data into hive
 String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
'pravin', true)";
 session.sql(insertQuery);
 // workaround END:

// Solution START
 Dataset dataset = readFromKafka(sparkSession); // private method reading 
data from Kafka's 'xyz' topic

// some code which writes dataset into hive table demo_user
 // Solution END

  was:
Using Apache Spark 2.2: Structured Streaming, I am creating a program which 
reads data from Kafka and write it to Hive.
I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec.

**Hive Table Created:**

CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, stringdd 
STRING, booleanee BOOLEAN ) STORED AS ORC ;


**Insert via Manual Hive Query:**

INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);


**Insert via spark structured streaming code:**

SparkConf conf = new SparkConf();
 conf.setAppName("testing");
 conf.setMaster("local[2]");
 conf.set("hive.metastore.uris", "thrift://localhost:9083");
 SparkSession session = 
 SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();

// workaround START: code to insert static data into hive
 String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
'pravin', true)";
 session.sql(insertQuery);
 // workaround END:

// Solution START
 Dataset dataset = readFromKafka(sparkSession); // private method reading 
data from Kafka's 'xyz' topic

// **My question here:**
 // some code which writes dataset into hive table demo_user
 // Solution END


> Apache Structured Streaming: Unable to write streaming dataset into Hive
> 
>
> Key: SPARK-23077
> URL: https://issues.apache.org/jira/browse/SPARK-23077
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Pravin Agrawal
>Priority: Minor
>
> Using Apache Spark 2.2: Structured Streaming, I am creating a program which 
> reads data from Kafka and write it to Hive.
>  I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec.
> **Hive Table Created:**
> CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, 
> stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ;
> **Insert via Manual Hive Query:**
> INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);
> **Insert via spark structured streaming code:**
> SparkConf conf = new SparkConf();
>  conf.setAppName("testing");
>  conf.setMaster("local[2]");
>  conf.set("hive.metastore.uris", "thrift://localhost:9083");
>  SparkSession session = 
>  SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
> // workaround START: code to insert static data into hive
>  String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
> 'pravin', true)";
>  session.sql(insertQuery);
>  // workaround END:
> // Solution START
>  Dataset dataset = readFromKafka(sparkSession); // private method 
> reading data from Kafka's 'xyz' topic
> // some code which writes dataset into hive table demo_user
>  // Solution END



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22993) checkpointInterval param doc should be clearer

2018-01-15 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-22993:
--

Assignee: Seth Hendrickson

> checkpointInterval param doc should be clearer
> --
>
> Key: SPARK-22993
> URL: https://issues.apache.org/jira/browse/SPARK-22993
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Trivial
>
> several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, 
> LDA, GBT), each of which silently ignores the parameter when the checkpoint 
> directory is not set on the spark context. This should be documented in the 
> param doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Unable to write streaming dataset into Hive?

2018-01-15 Thread Pravin Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Agrawal updated SPARK-23077:
---
Issue Type: Bug  (was: Question)

> Apache Structured Streaming: Unable to write streaming dataset into Hive?
> -
>
> Key: SPARK-23077
> URL: https://issues.apache.org/jira/browse/SPARK-23077
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Pravin Agrawal
>Priority: Minor
>
> Using Apache Spark 2.2: Structured Streaming, I am creating a program which 
> reads data from Kafka and write it to Hive.
> I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec.
> **Hive Table Created:**
> CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, 
> stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ;
> **Insert via Manual Hive Query:**
> INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);
> **Insert via spark structured streaming code:**
> SparkConf conf = new SparkConf();
>  conf.setAppName("testing");
>  conf.setMaster("local[2]");
>  conf.set("hive.metastore.uris", "thrift://localhost:9083");
>  SparkSession session = 
>  SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
> // workaround START: code to insert static data into hive
>  String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
> 'pravin', true)";
>  session.sql(insertQuery);
>  // workaround END:
> // Solution START
>  Dataset dataset = readFromKafka(sparkSession); // private method 
> reading data from Kafka's 'xyz' topic
> // **My question here:**
>  // some code which writes dataset into hive table demo_user
>  // Solution END



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Unable to write streaming dataset into Hive

2018-01-15 Thread Pravin Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Agrawal updated SPARK-23077:
---
Summary: Apache Structured Streaming: Unable to write streaming dataset 
into Hive  (was: Apache Structured Streaming: Unable to write streaming dataset 
into Hive?)

> Apache Structured Streaming: Unable to write streaming dataset into Hive
> 
>
> Key: SPARK-23077
> URL: https://issues.apache.org/jira/browse/SPARK-23077
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Pravin Agrawal
>Priority: Minor
>
> Using Apache Spark 2.2: Structured Streaming, I am creating a program which 
> reads data from Kafka and write it to Hive.
> I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec.
> **Hive Table Created:**
> CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, 
> stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ;
> **Insert via Manual Hive Query:**
> INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);
> **Insert via spark structured streaming code:**
> SparkConf conf = new SparkConf();
>  conf.setAppName("testing");
>  conf.setMaster("local[2]");
>  conf.set("hive.metastore.uris", "thrift://localhost:9083");
>  SparkSession session = 
>  SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
> // workaround START: code to insert static data into hive
>  String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
> 'pravin', true)";
>  session.sql(insertQuery);
>  // workaround END:
> // Solution START
>  Dataset dataset = readFromKafka(sparkSession); // private method 
> reading data from Kafka's 'xyz' topic
> // **My question here:**
>  // some code which writes dataset into hive table demo_user
>  // Solution END



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Unable to write streaming dataset into Hive?

2018-01-15 Thread Pravin Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Agrawal updated SPARK-23077:
---
Summary: Apache Structured Streaming: Unable to write streaming dataset 
into Hive?  (was: Apache Structured Streaming: How to write streaming dataset 
into Hive?)

> Apache Structured Streaming: Unable to write streaming dataset into Hive?
> -
>
> Key: SPARK-23077
> URL: https://issues.apache.org/jira/browse/SPARK-23077
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Pravin Agrawal
>Priority: Minor
>
> Using Apache Spark 2.2: Structured Streaming, I am creating a program which 
> reads data from Kafka and write it to Hive.
> I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec.
> **Hive Table Created:**
> CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, 
> stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ;
> **Insert via Manual Hive Query:**
> INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true);
> **Insert via spark structured streaming code:**
> SparkConf conf = new SparkConf();
>  conf.setAppName("testing");
>  conf.setMaster("local[2]");
>  conf.set("hive.metastore.uris", "thrift://localhost:9083");
>  SparkSession session = 
>  SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
> // workaround START: code to insert static data into hive
>  String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 
> 'pravin', true)";
>  session.sql(insertQuery);
>  // workaround END:
> // Solution START
>  Dataset dataset = readFromKafka(sparkSession); // private method 
> reading data from Kafka's 'xyz' topic
> // **My question here:**
>  // some code which writes dataset into hive table demo_user
>  // Solution END



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 119 matches

Mail list logo