[jira] [Updated] (SPARK-22739) Additional Expression Support for Objects
[ https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-22739: --- Priority: Major (was: Critical) > Additional Expression Support for Objects > - > > Key: SPARK-22739 > URL: https://issues.apache.org/jira/browse/SPARK-22739 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Aleksander Eskilson >Priority: Major > > Some discussion in Spark-Avro [1] motivates additions and minor changes to > the {{Objects}} Expressions API [2]. The proposed changes include > * a generalized form of {{initializeJavaBean}} taking a sequence of > initialization expressions that can be applied to instances of varying objects > * an object cast that performs a simple Java type cast against a value > * making {{ExternalMapToCatalyst}} public, for use in outside libraries > These changes would facilitate the writing of custom encoders for varying > objects that cannot already be readily converted to a statically typed > dataset by a JavaBean encoder (e.g. Avro). > [1] -- > https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110 > [2] -- > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal resolved SPARK-23020. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20223 [https://github.com/apache/spark/pull/20223] > Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > -- > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.3.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal reassigned SPARK-23020: -- Assignee: Marcelo Vanzin > Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > -- > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse
[ https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326795#comment-16326795 ] Apache Spark commented on SPARK-23085: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/20275 > API parity for mllib.linalg.Vectors.sparse > --- > > Key: SPARK-23085 > URL: https://issues.apache.org/jira/browse/SPARK-23085 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: > {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support > zero-length vectors. > In old MLLib, > {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}])}} also supports it. > However, > {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a > positve length. > > {code:java} > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > java.lang.IllegalArgumentException: requirement failed: The size of the > requested sparse vector must be greater than 0. > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) > ... 50 elided > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse
[ https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23085: Assignee: (was: Apache Spark) > API parity for mllib.linalg.Vectors.sparse > --- > > Key: SPARK-23085 > URL: https://issues.apache.org/jira/browse/SPARK-23085 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: > {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support > zero-length vectors. > In old MLLib, > {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}])}} also supports it. > However, > {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a > positve length. > > {code:java} > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > java.lang.IllegalArgumentException: requirement failed: The size of the > requested sparse vector must be greater than 0. > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) > ... 50 elided > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse
[ https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23085: Assignee: Apache Spark > API parity for mllib.linalg.Vectors.sparse > --- > > Key: SPARK-23085 > URL: https://issues.apache.org/jira/browse/SPARK-23085 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: > {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support > zero-length vectors. > In old MLLib, > {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}])}} also supports it. > However, > {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a > positve length. > > {code:java} > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > java.lang.IllegalArgumentException: requirement failed: The size of the > requested sparse vector must be greater than 0. > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) > ... 50 elided > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse
[ https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-23085: - Description: Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support zero-length vectors. In old MLLib, {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}])}} also supports it. However, {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a positve length. {code:java} scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) java.lang.IllegalArgumentException: requirement failed: The size of the requested sparse vector must be greater than 0. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) ... 50 elided {code} was: Both {ML.Vectors#sparse size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}] } and {{ ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) }} support zero-length vectors. In old MLLib, {{MLLib.Vectors.sparse( size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}] )}} also supports it. However, {{ ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) }} require a positve length. {code:java} scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) java.lang.IllegalArgumentException: requirement failed: The size of the requested sparse vector must be greater than 0. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) ... 50 elided {code} > API parity for mllib.linalg.Vectors.sparse > --- > > Key: SPARK-23085 > URL: https://issues.apache.org/jira/browse/SPARK-23085 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > Both {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}]}} and {{ML.Vectors#sparse(size: > {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} support > zero-length vectors. > In old MLLib, > {{MLLib.Vectors.sparse(size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}])}} also supports it. > However, > {{ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})])}} require a > positve length. > > {code:java} > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > java.lang.IllegalArgumentException: requirement failed: The size of the >
[jira] [Updated] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse
[ https://issues.apache.org/jira/browse/SPARK-23085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-23085: - Description: Both {ML.Vectors#sparse size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}] } and {{ ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) }} support zero-length vectors. In old MLLib, {{MLLib.Vectors.sparse( size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}] )}} also supports it. However, {{ ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) }} require a positve length. {code:java} scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) java.lang.IllegalArgumentException: requirement failed: The size of the requested sparse vector must be greater than 0. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) ... 50 elided {code} was: Both {{ML.Vectors#sparse size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}] }} and {{ ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) }} support zero-length vectors. In old MLLib, {{MLLib.Vectors.sparse( size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}] )}} also supports it. However, {{ ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) }} require a positve length. {code} scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) java.lang.IllegalArgumentException: requirement failed: The size of the requested sparse vector must be greater than 0. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) ... 50 elided {code} > API parity for mllib.linalg.Vectors.sparse > --- > > Key: SPARK-23085 > URL: https://issues.apache.org/jira/browse/SPARK-23085 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > Both {ML.Vectors#sparse > size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}] > } and {{ > ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) > }} support zero-length vectors. > In old MLLib, > {{MLLib.Vectors.sparse( > size: {color:#cc7832}Int, {color}indices: > Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: > Array[{color:#cc7832}Double{color}] > )}} also supports it. > However, > {{ > ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: > {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) > }} require a positve length. > > {code:java} > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], > Array.empty[Double]) > res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) > scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, > Double)]) > java.lang.IllegalArgumentException: requirement failed: The
[jira] [Created] (SPARK-23085) API parity for mllib.linalg.Vectors.sparse
zhengruifeng created SPARK-23085: Summary: API parity for mllib.linalg.Vectors.sparse Key: SPARK-23085 URL: https://issues.apache.org/jira/browse/SPARK-23085 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.0 Reporter: zhengruifeng Both {{ML.Vectors#sparse size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}] }} and {{ ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) }} support zero-length vectors. In old MLLib, {{MLLib.Vectors.sparse( size: {color:#cc7832}Int, {color}indices: Array[{color:#cc7832}Int{color}]{color:#cc7832}, {color}values: Array[{color:#cc7832}Double{color}] )}} also supports it. However, {{ ML.Vectors#sparse(size: {color:#cc7832}Int, {color}elements: {color:#4e807d}Seq{color}[({color:#cc7832}Int, Double{color})]) }} require a positve length. {code} scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res15: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.ml.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) res16: org.apache.spark.ml.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[Int], Array.empty[Double]) res17: org.apache.spark.mllib.linalg.Vector = (0,[],[]) scala> org.apache.spark.mllib.linalg.Vectors.sparse(0, Array.empty[(Int, Double)]) java.lang.IllegalArgumentException: requirement failed: The size of the requested sparse vector must be greater than 0. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.mllib.linalg.Vectors$.sparse(Vectors.scala:315) ... 50 elided {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22956) Union Stream Failover Cause `IllegalStateException`
[ https://issues.apache.org/jira/browse/SPARK-22956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-22956: - Fix Version/s: (was: 2.3.0) 2.4.0 2.3.1 > Union Stream Failover Cause `IllegalStateException` > --- > > Key: SPARK-22956 > URL: https://issues.apache.org/jira/browse/SPARK-22956 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Li Yuanjian >Assignee: Li Yuanjian >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > When we union 2 streams from kafka or other sources, while one of them have > no continues data coming and in the same time task restart, this will cause > an `IllegalStateException`. This mainly cause because the code in > [MicroBatchExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190] > , while one stream has no continues data, its comittedOffset same with > availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` > not properly handled in KafkaSource. Also, maybe we should also consider this > scenario in other Source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22956) Union Stream Failover Cause `IllegalStateException`
[ https://issues.apache.org/jira/browse/SPARK-22956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-22956. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20150 [https://github.com/apache/spark/pull/20150] > Union Stream Failover Cause `IllegalStateException` > --- > > Key: SPARK-22956 > URL: https://issues.apache.org/jira/browse/SPARK-22956 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Li Yuanjian >Assignee: Li Yuanjian >Priority: Major > Fix For: 2.3.0 > > > When we union 2 streams from kafka or other sources, while one of them have > no continues data coming and in the same time task restart, this will cause > an `IllegalStateException`. This mainly cause because the code in > [MicroBatchExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190] > , while one stream has no continues data, its comittedOffset same with > availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` > not properly handled in KafkaSource. Also, maybe we should also consider this > scenario in other Source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22956) Union Stream Failover Cause `IllegalStateException`
[ https://issues.apache.org/jira/browse/SPARK-22956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-22956: Assignee: Li Yuanjian > Union Stream Failover Cause `IllegalStateException` > --- > > Key: SPARK-22956 > URL: https://issues.apache.org/jira/browse/SPARK-22956 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Li Yuanjian >Assignee: Li Yuanjian >Priority: Major > > When we union 2 streams from kafka or other sources, while one of them have > no continues data coming and in the same time task restart, this will cause > an `IllegalStateException`. This mainly cause because the code in > [MicroBatchExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L190] > , while one stream has no continues data, its comittedOffset same with > availableOffset during `populateStartOffsets`, and `currentPartitionOffsets` > not properly handled in KafkaSource. Also, maybe we should also consider this > scenario in other Source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-19700: --- Issue Type: Improvement (was: Bug) > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah >Priority: Major > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/
[ https://issues.apache.org/jira/browse/SPARK-23083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326711#comment-16326711 ] Anirudh Ramanathan commented on SPARK-23083: opened https://github.com/apache/spark-website/pull/87 > Adding Kubernetes as an option to https://spark.apache.org/ > --- > > Key: SPARK-23083 > URL: https://issues.apache.org/jira/browse/SPARK-23083 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > [https://spark.apache.org/] can now include a reference to, and the k8s logo. > I think this is not tied to the docs. > cc/ [~rxin] [~sameer] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23080) Improve error message for built-in functions
[ https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23080. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20271 [https://github.com/apache/spark/pull/20271] > Improve error message for built-in functions > > > Key: SPARK-23080 > URL: https://issues.apache.org/jira/browse/SPARK-23080 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Trivial > Fix For: 2.3.0 > > > When a user puts the wrong number of parameters in a function, an > AnalysisException is thrown. If the function is a UDF, he user is told how > many parameters the function expected and how many he/she put. If the > function, instead, is a built-in one, no information about the number of > parameters expected and the actual one is provided. This can help in some > cases, to debug the errors (eg. bad quotes escaping may lead to a different > number of parameters than expected, etc. etc.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23080) Improve error message for built-in functions
[ https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23080: Assignee: Marco Gaido > Improve error message for built-in functions > > > Key: SPARK-23080 > URL: https://issues.apache.org/jira/browse/SPARK-23080 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Trivial > Fix For: 2.3.0 > > > When a user puts the wrong number of parameters in a function, an > AnalysisException is thrown. If the function is a UDF, he user is told how > many parameters the function expected and how many he/she put. If the > function, instead, is a built-in one, no information about the number of > parameters expected and the actual one is provided. This can help in some > cases, to debug the errors (eg. bad quotes escaping may lead to a different > number of parameters than expected, etc. etc.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22624) Expose range partitioning shuffle introduced by SPARK-22614
[ https://issues.apache.org/jira/browse/SPARK-22624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326707#comment-16326707 ] xubo245 commented on SPARK-22624: - [~smilegator] ok, I will finished it. > Expose range partitioning shuffle introduced by SPARK-22614 > --- > > Key: SPARK-22624 > URL: https://issues.apache.org/jira/browse/SPARK-22624 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23076) When we call cache() on RDD which depends on ShuffleRowRDD, we will get an error result
[ https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326685#comment-16326685 ] zhoukang commented on SPARK-23076: -- Yes,may it should be an improvement?Since i suppose some other may have same requirement. And the original reason we cache the MapPartitionsRDD is that we do result size estimate in SparkExecuteStatementOperation#execute,so we cache the result rdd first. The reason we estimate the result size is to avoid thriftserver oom error. [~cloud_fan] > When we call cache() on RDD which depends on ShuffleRowRDD, we will get an > error result > --- > > Key: SPARK-23076 > URL: https://issues.apache.org/jira/browse/SPARK-23076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Major > Attachments: shufflerowrdd-cache.png > > > For query below: > {code:java} > select * from csv_demo limit 3; > {code} > The correct result should be: > 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Joe|20| > |Tom|30| > |Hyukjin|25| > ++++-- > However,when we call cache on MapPartitionsRDD below: > !shufflerowrdd-cache.png! > Then result will be error: > 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Hyukjin|25| > |Hyukjin|25| > |Hyukjin|25| > ++++-- > The reason why this happen is that: > UnsafeRow which generated by ShuffleRowRDD#compute will use the same under > byte buffer > I print some log below to explain this: > Modify UnsafeRow.toString() > {code:java} > // This is for debugging > @Override > public String toString() { > StringBuilder build = new StringBuilder("["); > for (int i = 0; i < sizeInBytes; i += 8) { > if (i != 0) build.append(','); > build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, > baseOffset + i))); > } > build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and > baseOffset here > return build.toString(); > }{code} > {code:java} > 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] > 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] > 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] > {code} > we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD > iterator when config is true,like below: > {code:java} > reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { > if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) > && x._2.isInstanceOf[UnsafeRow]) { > (x._2).asInstanceOf[UnsafeRow].copy() > } else { > x._2 > } > }) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23076) When we call cache() on RDD which depends on ShuffleRowRDD, we will get an error result
[ https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-23076: - Issue Type: Improvement (was: Bug) > When we call cache() on RDD which depends on ShuffleRowRDD, we will get an > error result > --- > > Key: SPARK-23076 > URL: https://issues.apache.org/jira/browse/SPARK-23076 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Major > Attachments: shufflerowrdd-cache.png > > > For query below: > {code:java} > select * from csv_demo limit 3; > {code} > The correct result should be: > 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Joe|20| > |Tom|30| > |Hyukjin|25| > ++++-- > However,when we call cache on MapPartitionsRDD below: > !shufflerowrdd-cache.png! > Then result will be error: > 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Hyukjin|25| > |Hyukjin|25| > |Hyukjin|25| > ++++-- > The reason why this happen is that: > UnsafeRow which generated by ShuffleRowRDD#compute will use the same under > byte buffer > I print some log below to explain this: > Modify UnsafeRow.toString() > {code:java} > // This is for debugging > @Override > public String toString() { > StringBuilder build = new StringBuilder("["); > for (int i = 0; i < sizeInBytes; i += 8) { > if (i != 0) build.append(','); > build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, > baseOffset + i))); > } > build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and > baseOffset here > return build.toString(); > }{code} > {code:java} > 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] > 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] > 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] > {code} > we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD > iterator when config is true,like below: > {code:java} > reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { > if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) > && x._2.isInstanceOf[UnsafeRow]) { > (x._2).asInstanceOf[UnsafeRow].copy() > } else { > x._2 > } > }) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20120) spark-sql CLI support silent mode
[ https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326673#comment-16326673 ] Apache Spark commented on SPARK-20120: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/20274 > spark-sql CLI support silent mode > - > > Key: SPARK-20120 > URL: https://issues.apache.org/jira/browse/SPARK-20120 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.2.0 > > > It is similar to Hive silent mode, just show the query result. see: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xubo245 updated SPARK-23035: Description: Problem: it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. {code:java} /** * Creates a global temp view, or issue an exception if the view already exists and * `overrideIfExists` is false. */ def create( name: String, viewDefinition: LogicalPlan, overrideIfExists: Boolean): Unit = synchronized { if (!overrideIfExists && viewDefinitions.contains(name)) { throw new TempTableAlreadyExistsException(name) } viewDefinitions.put(name, viewDefinition) } {code} No need to fix: warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view There are warning when run test: test("rename temporary view - destination table with database name") 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW ... USING ... instead other test cases also have this warning was: Problem: it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. {code:java} /** * Creates a global temp view, or issue an exception if the view already exists and * `overrideIfExists` is false. */ def create( name: String, viewDefinition: LogicalPlan, overrideIfExists: Boolean): Unit = synchronized { if (!overrideIfExists && viewDefinitions.contains(name)) { throw new TempTableAlreadyExistsException(name) } viewDefinitions.put(name, viewDefinition) } No need to fix: warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view{code} There are warning when run test: test("rename temporary view - destination table with database name") 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW ... USING ... instead other test cases also have this warning > Fix improper information of TempTableAlreadyExistsException > --- > > Key: SPARK-23035 > URL: https://issues.apache.org/jira/browse/SPARK-23035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: xubo245 >Priority: Major > Fix For: 2.3.0 > > > > Problem: it throw TempTableAlreadyExistsException and output "Temporary table > '$table' already exists" when we create temp view by using > org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's > improper. > {code:java} > /** >* Creates a global temp view, or issue an exception if the view already > exists and >* `overrideIfExists` is false. >*/ > def create( > name: String, > viewDefinition: LogicalPlan, > overrideIfExists: Boolean): Unit = synchronized { > if (!overrideIfExists && viewDefinitions.contains(name)) { > throw new TempTableAlreadyExistsException(name) > } > viewDefinitions.put(name, viewDefinition) > } > {code} > No need to fix: > warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > There are warning when run test: test("rename temporary view - destination > table with database name") > 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE > TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW > ... USING ... instead > other test cases also have this warning -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xubo245 updated SPARK-23035: Description: Problem: it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. {code:java} /** * Creates a global temp view, or issue an exception if the view already exists and * `overrideIfExists` is false. */ def create( name: String, viewDefinition: LogicalPlan, overrideIfExists: Boolean): Unit = synchronized { if (!overrideIfExists && viewDefinitions.contains(name)) { throw new TempTableAlreadyExistsException(name) } viewDefinitions.put(name, viewDefinition) } No need to fix: warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view{code} There are warning when run test: test("rename temporary view - destination table with database name") 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW ... USING ... instead other test cases also have this warning was: Problem: it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. {code:java} /** * Creates a global temp view, or issue an exception if the view already exists and * `overrideIfExists` is false. */ def create( name: String, viewDefinition: LogicalPlan, overrideIfExists: Boolean): Unit = synchronized { if (!overrideIfExists && viewDefinitions.contains(name)) { throw new TempTableAlreadyExistsException(name) } viewDefinitions.put(name, viewDefinition) } {code} *No need to fix: * warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view There are warning when run test: test("rename temporary view - destination table with database name") 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW ... USING ... instead other test cases also have this warning > Fix improper information of TempTableAlreadyExistsException > --- > > Key: SPARK-23035 > URL: https://issues.apache.org/jira/browse/SPARK-23035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: xubo245 >Priority: Major > Fix For: 2.3.0 > > > > Problem: it throw TempTableAlreadyExistsException and output "Temporary table > '$table' already exists" when we create temp view by using > org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's > improper. > {code:java} > /** >* Creates a global temp view, or issue an exception if the view already > exists and >* `overrideIfExists` is false. >*/ > def create( > name: String, > viewDefinition: LogicalPlan, > overrideIfExists: Boolean): Unit = synchronized { > if (!overrideIfExists && viewDefinitions.contains(name)) { > throw new TempTableAlreadyExistsException(name) > } > viewDefinitions.put(name, viewDefinition) > } > No need to fix: > warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view{code} > There are warning when run test: test("rename temporary view - destination > table with database name") > 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE > TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW > ... USING ... instead > other test cases also have this warning -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xubo245 updated SPARK-23035: Description: Problem: it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. {code:java} /** * Creates a global temp view, or issue an exception if the view already exists and * `overrideIfExists` is false. */ def create( name: String, viewDefinition: LogicalPlan, overrideIfExists: Boolean): Unit = synchronized { if (!overrideIfExists && viewDefinitions.contains(name)) { throw new TempTableAlreadyExistsException(name) } viewDefinitions.put(name, viewDefinition) } {code} *No need to fix: * warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view There are warning when run test: test("rename temporary view - destination table with database name") 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW ... USING ... instead other test cases also have this warning was: Problem: it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. {code:java} /** * Creates a global temp view, or issue an exception if the view already exists and * `overrideIfExists` is false. */ def create( name: String, viewDefinition: LogicalPlan, overrideIfExists: Boolean): Unit = synchronized { if (!overrideIfExists && viewDefinitions.contains(name)) { throw new TempTableAlreadyExistsException(name) } viewDefinitions.put(name, viewDefinition) } {code} *No need to fix: * warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view There are warning when run test: test("rename temporary view - destination table with database name") 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW ... USING ... instead other test cases also have this warning > Fix improper information of TempTableAlreadyExistsException > --- > > Key: SPARK-23035 > URL: https://issues.apache.org/jira/browse/SPARK-23035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: xubo245 >Priority: Major > Fix For: 2.3.0 > > > > Problem: it throw TempTableAlreadyExistsException and output "Temporary table > '$table' already exists" when we create temp view by using > org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's > improper. > {code:java} > /** >* Creates a global temp view, or issue an exception if the view already > exists and >* `overrideIfExists` is false. >*/ > def create( > name: String, > viewDefinition: LogicalPlan, > overrideIfExists: Boolean): Unit = synchronized { > if (!overrideIfExists && viewDefinitions.contains(name)) { > throw new TempTableAlreadyExistsException(name) > } > viewDefinitions.put(name, viewDefinition) > } > {code} > *No need to fix: > * warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > There are warning when run test: test("rename temporary view - destination > table with database name") > 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE > TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW > ... USING ... instead > other test cases also have this warning -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xubo245 updated SPARK-23035: Description: Problem: it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. {code:java} /** * Creates a global temp view, or issue an exception if the view already exists and * `overrideIfExists` is false. */ def create( name: String, viewDefinition: LogicalPlan, overrideIfExists: Boolean): Unit = synchronized { if (!overrideIfExists && viewDefinitions.contains(name)) { throw new TempTableAlreadyExistsException(name) } viewDefinitions.put(name, viewDefinition) } {code} *No need to fix: * warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view There are warning when run test: test("rename temporary view - destination table with database name") 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW ... USING ... instead other test cases also have this warning was: Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view There are warning when run test: test("rename temporary view - destination table with database name") {code:java} 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW ... USING ... instead {code} other test cases also have this warning Another problem, it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. {code:java} /** * Creates a global temp view, or issue an exception if the view already exists and * `overrideIfExists` is false. */ def create( name: String, viewDefinition: LogicalPlan, overrideIfExists: Boolean): Unit = synchronized { if (!overrideIfExists && viewDefinitions.contains(name)) { throw new TempTableAlreadyExistsException(name) } viewDefinitions.put(name, viewDefinition) } {code} > Fix improper information of TempTableAlreadyExistsException > --- > > Key: SPARK-23035 > URL: https://issues.apache.org/jira/browse/SPARK-23035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: xubo245 >Priority: Major > Fix For: 2.3.0 > > > > Problem: it throw TempTableAlreadyExistsException and output "Temporary table > '$table' already exists" when we create temp view by using > org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's > improper. > {code:java} > /** >* Creates a global temp view, or issue an exception if the view already > exists and >* `overrideIfExists` is false. >*/ > def create( > name: String, > viewDefinition: LogicalPlan, > overrideIfExists: Boolean): Unit = synchronized { > if (!overrideIfExists && viewDefinitions.contains(name)) { > throw new TempTableAlreadyExistsException(name) > } > viewDefinitions.put(name, viewDefinition) > } > {code} > *No need to fix: * > warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > There are warning when run test: test("rename temporary view - destination > table with database name") > 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE > TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW > ... USING ... instead > other test cases also have this warning -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23035) Fix improper information of TempTableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xubo245 updated SPARK-23035: Summary: Fix improper information of TempTableAlreadyExistsException (was: Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view) > Fix improper information of TempTableAlreadyExistsException > --- > > Key: SPARK-23035 > URL: https://issues.apache.org/jira/browse/SPARK-23035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: xubo245 >Priority: Major > Fix For: 2.3.0 > > > Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > There are warning when run test: test("rename temporary view - destination > table with database name") > {code:java} > 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE > TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW > ... USING ... instead > {code} > other test cases also have this warning > Another problem, it throw TempTableAlreadyExistsException and output > "Temporary table '$table' already exists" when we create temp view by using > org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's > improper. > {code:java} > /** >* Creates a global temp view, or issue an exception if the view already > exists and >* `overrideIfExists` is false. >*/ > def create( > name: String, > viewDefinition: LogicalPlan, > overrideIfExists: Boolean): Unit = synchronized { > if (!overrideIfExists && viewDefinitions.contains(name)) { > throw new TempTableAlreadyExistsException(name) > } > viewDefinitions.put(name, viewDefinition) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326663#comment-16326663 ] Apache Spark commented on SPARK-23000: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/20273 > Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3 > - > > Key: SPARK-23000 > URL: https://issues.apache.org/jira/browse/SPARK-23000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/ > The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always > failed in hadoop 2.6 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23084: Description: Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) to PySpark. Also update the rangeBetween API {noformat} /** * Window function: returns the special frame boundary that represents the first row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedPreceding(): Column = Column(UnboundedPreceding) /** * Window function: returns the special frame boundary that represents the last row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedFollowing(): Column = Column(UnboundedFollowing) /** * Window function: returns the special frame boundary that represents the current row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def currentRow(): Column = Column(CurrentRow) {noformat} was: Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) to PySpark {noformat} /** * Window function: returns the special frame boundary that represents the first row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedPreceding(): Column = Column(UnboundedPreceding) /** * Window function: returns the special frame boundary that represents the last row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedFollowing(): Column = Column(UnboundedFollowing) /** * Window function: returns the special frame boundary that represents the current row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def currentRow(): Column = Column(CurrentRow) {noformat} > Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark > --- > > Key: SPARK-23084 > URL: https://issues.apache.org/jira/browse/SPARK-23084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) > to PySpark. Also update the rangeBetween API > {noformat} > /** > * Window function: returns the special frame boundary that represents the > first row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def unboundedPreceding(): Column = Column(UnboundedPreceding) > /** > * Window function: returns the special frame boundary that represents the > last row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def unboundedFollowing(): Column = Column(UnboundedFollowing) > /** > * Window function: returns the special frame boundary that represents the > current row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def currentRow(): Column = Column(CurrentRow) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23084: Description: Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) to PySpark {noformat} /** * Window function: returns the special frame boundary that represents the first row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedPreceding(): Column = Column(UnboundedPreceding) /** * Window function: returns the special frame boundary that represents the last row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedFollowing(): Column = Column(UnboundedFollowing) /** * Window function: returns the special frame boundary that represents the current row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def currentRow(): Column = Column(CurrentRow) {noformat} > Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark > --- > > Key: SPARK-23084 > URL: https://issues.apache.org/jira/browse/SPARK-23084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 > Environment: Add the new APIs (introduced by > https://github.com/apache/spark/pull/18814) to PySpark > {noformat} > /** > * Window function: returns the special frame boundary that represents the > first row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def unboundedPreceding(): Column = Column(UnboundedPreceding) > /** > * Window function: returns the special frame boundary that represents the > last row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def unboundedFollowing(): Column = Column(UnboundedFollowing) > /** > * Window function: returns the special frame boundary that represents the > current row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def currentRow(): Column = Column(CurrentRow) > {noformat} >Reporter: Xiao Li >Priority: Major > > Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) > to PySpark > {noformat} > /** > * Window function: returns the special frame boundary that represents the > first row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def unboundedPreceding(): Column = Column(UnboundedPreceding) > /** > * Window function: returns the special frame boundary that represents the > last row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def unboundedFollowing(): Column = Column(UnboundedFollowing) > /** > * Window function: returns the special frame boundary that represents the > current row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def currentRow(): Column = Column(CurrentRow) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23084: Environment: (was: Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) to PySpark {noformat} /** * Window function: returns the special frame boundary that represents the first row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedPreceding(): Column = Column(UnboundedPreceding) /** * Window function: returns the special frame boundary that represents the last row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedFollowing(): Column = Column(UnboundedFollowing) /** * Window function: returns the special frame boundary that represents the current row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def currentRow(): Column = Column(CurrentRow) {noformat}) > Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark > --- > > Key: SPARK-23084 > URL: https://issues.apache.org/jira/browse/SPARK-23084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) > to PySpark > {noformat} > /** > * Window function: returns the special frame boundary that represents the > first row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def unboundedPreceding(): Column = Column(UnboundedPreceding) > /** > * Window function: returns the special frame boundary that represents the > last row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def unboundedFollowing(): Column = Column(UnboundedFollowing) > /** > * Window function: returns the special frame boundary that represents the > current row in the > * window partition. > * > * @group window_funcs > * @since 2.3.0 > */ > def currentRow(): Column = Column(CurrentRow) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark
Xiao Li created SPARK-23084: --- Summary: Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark Key: SPARK-23084 URL: https://issues.apache.org/jira/browse/SPARK-23084 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.3.0 Environment: Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) to PySpark {noformat} /** * Window function: returns the special frame boundary that represents the first row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedPreceding(): Column = Column(UnboundedPreceding) /** * Window function: returns the special frame boundary that represents the last row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def unboundedFollowing(): Column = Column(UnboundedFollowing) /** * Window function: returns the special frame boundary that represents the current row in the * window partition. * * @group window_funcs * @since 2.3.0 */ def currentRow(): Column = Column(CurrentRow) {noformat} Reporter: Xiao Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/
[ https://issues.apache.org/jira/browse/SPARK-23083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326635#comment-16326635 ] Sean Owen commented on SPARK-23083: --- Yes that's fine. If it only makes sense after the 2.3 release then we'll wait to merge the PR. > Adding Kubernetes as an option to https://spark.apache.org/ > --- > > Key: SPARK-23083 > URL: https://issues.apache.org/jira/browse/SPARK-23083 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > [https://spark.apache.org/] can now include a reference to, and the k8s logo. > I think this is not tied to the docs. > cc/ [~rxin] [~sameer] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/
[ https://issues.apache.org/jira/browse/SPARK-23083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326631#comment-16326631 ] Anirudh Ramanathan commented on SPARK-23083: Thanks. I'll create a PR against that repo. Since it's separate, it can be merged and updated right around the time the release goes out? > Adding Kubernetes as an option to https://spark.apache.org/ > --- > > Key: SPARK-23083 > URL: https://issues.apache.org/jira/browse/SPARK-23083 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > [https://spark.apache.org/] can now include a reference to, and the k8s logo. > I think this is not tied to the docs. > cc/ [~rxin] [~sameer] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/
[ https://issues.apache.org/jira/browse/SPARK-23083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326619#comment-16326619 ] Reynold Xin commented on SPARK-23083: - Here's the website repo: [https://github.com/apache/spark-website] > Adding Kubernetes as an option to https://spark.apache.org/ > --- > > Key: SPARK-23083 > URL: https://issues.apache.org/jira/browse/SPARK-23083 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > [https://spark.apache.org/] can now include a reference to, and the k8s logo. > I think this is not tied to the docs. > cc/ [~rxin] [~sameer] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23083) Adding Kubernetes as an option to https://spark.apache.org/
Anirudh Ramanathan created SPARK-23083: -- Summary: Adding Kubernetes as an option to https://spark.apache.org/ Key: SPARK-23083 URL: https://issues.apache.org/jira/browse/SPARK-23083 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 2.3.0 Reporter: Anirudh Ramanathan [https://spark.apache.org/] can now include a reference to, and the k8s logo. I think this is not tied to the docs. cc/ [~rxin] [~sameer] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326521#comment-16326521 ] Apache Spark commented on SPARK-23078: -- User 'ozzieba' has created a pull request for this issue: https://github.com/apache/spark/pull/20272 > Allow Submitting Spark Thrift Server in Cluster Mode > > > Key: SPARK-23078 > URL: https://issues.apache.org/jira/browse/SPARK-23078 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit, SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running > in Cluster mode, since at the time it was not able to do so successfully. I > have confirmed that Spark Thrift Server can run on Cluster mode in > Kubernetes, by commenting out > [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] > I have not had a chance to test against YARN. Since Kubernetes does not have > Client mode, this change is necessary to run Spark Thrift Service in > Kubernetes. > [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23078: Assignee: Apache Spark > Allow Submitting Spark Thrift Server in Cluster Mode > > > Key: SPARK-23078 > URL: https://issues.apache.org/jira/browse/SPARK-23078 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit, SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Oz Ben-Ami >Assignee: Apache Spark >Priority: Minor > > Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running > in Cluster mode, since at the time it was not able to do so successfully. I > have confirmed that Spark Thrift Server can run on Cluster mode in > Kubernetes, by commenting out > [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] > I have not had a chance to test against YARN. Since Kubernetes does not have > Client mode, this change is necessary to run Spark Thrift Service in > Kubernetes. > [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23078: Assignee: (was: Apache Spark) > Allow Submitting Spark Thrift Server in Cluster Mode > > > Key: SPARK-23078 > URL: https://issues.apache.org/jira/browse/SPARK-23078 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit, SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running > in Cluster mode, since at the time it was not able to do so successfully. I > have confirmed that Spark Thrift Server can run on Cluster mode in > Kubernetes, by commenting out > [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] > I have not had a chance to test against YARN. Since Kubernetes does not have > Client mode, this change is necessary to run Spark Thrift Service in > Kubernetes. > [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23074) Dataframe-ified zipwithindex
[ https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326489#comment-16326489 ] Ruslan Dautkhanov commented on SPARK-23074: --- Yep, there are use cases where ordering is provided like reading files. We run production jobs that require going back to rdd api just to do zipWithIndex which is not as straight-forward as it could be if if there would be a dataframe-level API for zipWithIndex().. and not as performant. That SO answer got almost 30 upvotes in 2 years so I know we're not alone and it could benefit many others. Thanks. > Dataframe-ified zipwithindex > > > Key: SPARK-23074 > URL: https://issues.apache.org/jira/browse/SPARK-23074 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Ruslan Dautkhanov >Priority: Minor > Labels: dataframe, rdd > > Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex(): > {code:java} > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.types.{LongType, StructField, StructType} > import org.apache.spark.sql.Row > def dfZipWithIndex( > df: DataFrame, > offset: Int = 1, > colName: String = "id", > inFront: Boolean = true > ) : DataFrame = { > df.sqlContext.createDataFrame( > df.rdd.zipWithIndex.map(ln => > Row.fromSeq( > (if (inFront) Seq(ln._2 + offset) else Seq()) > ++ ln._1.toSeq ++ > (if (inFront) Seq() else Seq(ln._2 + offset)) > ) > ), > StructType( > (if (inFront) Array(StructField(colName,LongType,false)) else > Array[StructField]()) > ++ df.schema.fields ++ > (if (inFront) Array[StructField]() else > Array(StructField(colName,LongType,false))) > ) > ) > } > {code} > credits: > [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23074) Dataframe-ified zipwithindex
[ https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326479#comment-16326479 ] Sean Owen commented on SPARK-23074: --- Hm, rowNumber requires you to sort the input? I didn't think it did, semantically. The numbering isn't so meaningful unless the input has a defined ordering, sure, but the same is true of an RDD. Unless you sort it, the indexing could change when it's evaluated again. You're not really guaranteed what order you see the data, although in practice, like in your example, you will get data from things like files in the order you expect. > Dataframe-ified zipwithindex > > > Key: SPARK-23074 > URL: https://issues.apache.org/jira/browse/SPARK-23074 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Ruslan Dautkhanov >Priority: Minor > Labels: dataframe, rdd > > Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex(): > {code:java} > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.types.{LongType, StructField, StructType} > import org.apache.spark.sql.Row > def dfZipWithIndex( > df: DataFrame, > offset: Int = 1, > colName: String = "id", > inFront: Boolean = true > ) : DataFrame = { > df.sqlContext.createDataFrame( > df.rdd.zipWithIndex.map(ln => > Row.fromSeq( > (if (inFront) Seq(ln._2 + offset) else Seq()) > ++ ln._1.toSeq ++ > (if (inFront) Seq() else Seq(ln._2 + offset)) > ) > ), > StructType( > (if (inFront) Array(StructField(colName,LongType,false)) else > Array[StructField]()) > ++ df.schema.fields ++ > (if (inFront) Array[StructField]() else > Array(StructField(colName,LongType,false))) > ) > ) > } > {code} > credits: > [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oz Ben-Ami updated SPARK-23082: --- Description: In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark driver to a different set of nodes from its executors. In Kubernetes, we can specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use separate options for the driver and executors. This would be useful for the particular use case where executors can go on more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot instances), but the driver should use a more persistent machine. The required change would be minimal, essentially just using different config keys for the [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] and [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both. was: In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark driver to a different set of nodes from its executors. In Kubernetes, we can specify spark.kubernetes.node.selector.[labelKey], but we can't use separate options for the driver and executors. This would be useful for the particular use case where executors can go on more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot instances), but the driver should use a more persistent machine. The required change would be minimal, essentially just using different config keys in [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] and [https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] instead of KUBERNETES_NODE_SELECTOR_PREFIX for both. > Allow separate node selectors for driver and executors in Kubernetes > > > Key: SPARK-23082 > URL: https://issues.apache.org/jira/browse/SPARK-23082 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit >Affects Versions: 2.2.0, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark > driver to a different set of nodes from its executors. In Kubernetes, we can > specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use > separate options for the driver and executors. > This would be useful for the particular use case where executors can go on > more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot > instances), but the driver should use a more persistent machine. > The required change would be minimal, essentially just using different config > keys for the > [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] > and > [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] > instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oz Ben-Ami updated SPARK-23082: --- Description: In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark driver to a different set of nodes from its executors. In Kubernetes, we can specify spark.kubernetes.node.selector.[labelKey], but we can't use separate options for the driver and executors. This would be useful for the particular use case where executors can go on more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot instances), but the driver should use a more persistent machine. The required change would be minimal, essentially just using different config keys in [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] and [https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] instead of KUBERNETES_NODE_SELECTOR_PREFIX for both. was: In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark driver to a different set of nodes from its executors. In Kubernetes, we can specify spark.kubernetes.node.selector.[labelKey], but we can't use separate options for the driver and executors. This would be useful for the particular use case where executors can go on more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot instances), but the driver should use a more persistent machine. The required change would be minimal, essentially just using different config keys in [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] and [https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] instead of KUBERNETES_NODE_SELECTOR_PREFIX for both. > Allow separate node selectors for driver and executors in Kubernetes > > > Key: SPARK-23082 > URL: https://issues.apache.org/jira/browse/SPARK-23082 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit >Affects Versions: 2.2.0, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark > driver to a different set of nodes from its executors. In Kubernetes, we can > specify spark.kubernetes.node.selector.[labelKey], but we can't use separate > options for the driver and executors. > This would be useful for the particular use case where executors can go on > more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot > instances), but the driver should use a more persistent machine. > The required change would be minimal, essentially just using different config > keys in > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] > and > [https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] > instead of KUBERNETES_NODE_SELECTOR_PREFIX for both. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes
Oz Ben-Ami created SPARK-23082: -- Summary: Allow separate node selectors for driver and executors in Kubernetes Key: SPARK-23082 URL: https://issues.apache.org/jira/browse/SPARK-23082 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Submit Affects Versions: 2.2.0, 2.3.0 Reporter: Oz Ben-Ami In YARN, we can use spark.yarn.am.nodeLabelExpression to submit the Spark driver to a different set of nodes from its executors. In Kubernetes, we can specify spark.kubernetes.node.selector.[labelKey], but we can't use separate options for the driver and executors. This would be useful for the particular use case where executors can go on more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot instances), but the driver should use a more persistent machine. The required change would be minimal, essentially just using different config keys in [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] and [https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] instead of KUBERNETES_NODE_SELECTOR_PREFIX for both. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21108) convert LinearSVC to aggregator framework
[ https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-21108. Resolution: Fixed > convert LinearSVC to aggregator framework > - > > Key: SPARK-21108 > URL: https://issues.apache.org/jira/browse/SPARK-21108 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12717) pyspark broadcast fails when using multiple threads
[ https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326459#comment-16326459 ] Bryan Cutler commented on SPARK-12717: -- Hi [~codlife], you can use Spark 2.2.1 which was released in December or the upcoming release of 2.3.0, both have this fix. > pyspark broadcast fails when using multiple threads > --- > > Key: SPARK-12717 > URL: https://issues.apache.org/jira/browse/SPARK-12717 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 > Environment: Linux, python 2.6 or python 2.7. >Reporter: Edward Walker >Assignee: Bryan Cutler >Priority: Critical > Fix For: 2.1.2, 2.2.1, 2.3.0 > > Attachments: run.log > > > The following multi-threaded program that uses broadcast variables > consistently throws exceptions like: *Exception("Broadcast variable '18' not > loaded!",)* --- even when run with "--master local[10]". > {code:title=bug_spark.py|borderStyle=solid} > try: > > import pyspark > > except: > > pass > > from optparse import OptionParser > > > > def my_option_parser(): > > op = OptionParser() > > op.add_option("--parallelism", dest="parallelism", type="int", > default=20) > return op > > > > def do_process(x, w): > > return x * w.value > > > > def func(name, rdd, conf): > > new_rdd = rdd.map(lambda x : do_process(x, conf)) > > total = new_rdd.reduce(lambda x, y : x + y) > > count = rdd.count() > > print name, 1.0 * total / count > > > > if __name__ == "__main__": > > import threading > > op = my_option_parser() > > options, args = op.parse_args() > > sc = pyspark.SparkContext(appName="Buggy") > > data_rdd = sc.parallelize(range(0,1000), 1) > > confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ] > > threads = [ threading.Thread(target=func, args=["thread_" + str(i), > data_rdd, confs[i]]) for i in xrange(options.parallelism) ] > > for t in threads: > > t.start() > > for t in threads: > > t.join() > {code} > Abridged run output: >
[jira] [Commented] (SPARK-23074) Dataframe-ified zipwithindex
[ https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326455#comment-16326455 ] Ruslan Dautkhanov commented on SPARK-23074: --- {quote}You can create a DataFrame from the result of .zipWithIndex on an RDD, as you see here.{quote} That's very kludgy as you can see the code snippet above. A direct functionality for this RDD-level api would be super awesome to have. {quote}There's already a rowNumber function in Spark SQL, however, which sounds like the native equivalent?{quote} That's not the same. rowNumber requires an expression to `order by` on. What if there is no such column? For example, we often have files that we ingest into Spark and where we physical position of a record is meaningful how that record has to be processed. I can give exact example if you're necessary. zipWithIndex() actually the only one API call that preserves such information from original source (even though it can be distributed into multiple partitions etc.). Also as folks in that stackoverflow question said, rowNumber approach is way slower (and it's not surprizing as it requires data sorting). > Dataframe-ified zipwithindex > > > Key: SPARK-23074 > URL: https://issues.apache.org/jira/browse/SPARK-23074 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Ruslan Dautkhanov >Priority: Minor > Labels: dataframe, rdd > > Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex(): > {code:java} > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.types.{LongType, StructField, StructType} > import org.apache.spark.sql.Row > def dfZipWithIndex( > df: DataFrame, > offset: Int = 1, > colName: String = "id", > inFront: Boolean = true > ) : DataFrame = { > df.sqlContext.createDataFrame( > df.rdd.zipWithIndex.map(ln => > Row.fromSeq( > (if (inFront) Seq(ln._2 + offset) else Seq()) > ++ ln._1.toSeq ++ > (if (inFront) Seq() else Seq(ln._2 + offset)) > ) > ), > StructType( > (if (inFront) Array(StructField(colName,LongType,false)) else > Array[StructField]()) > ++ df.schema.fields ++ > (if (inFront) Array[StructField]() else > Array(StructField(colName,LongType,false))) > ) > ) > } > {code} > credits: > [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23074) Dataframe-ified zipwithindex
[ https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326455#comment-16326455 ] Ruslan Dautkhanov edited comment on SPARK-23074 at 1/15/18 5:40 PM: {quote}You can create a DataFrame from the result of .zipWithIndex on an RDD, as you see here. {quote} That's very kludgy as you can see the code snippet above. A direct functionality for this RDD-level api would be super awesome to have. {quote}There's already a rowNumber function in Spark SQL, however, which sounds like the native equivalent? {quote} That's not the same. rowNumber requires an expression to `order by` on. What if there is no such column? For example, we often have files that we ingest into Spark and where we physical position of a record is meaningful how that record has to be processed. I can give exact example if you're interested. zipWithIndex() actually the only one API call that preserves such information from original source (even though it can be distributed into multiple partitions etc.). Also as folks in that stackoverflow question said, rowNumber approach is way slower (and it's not surprizing as it requires data sorting). was (Author: tagar): {quote}You can create a DataFrame from the result of .zipWithIndex on an RDD, as you see here.{quote} That's very kludgy as you can see the code snippet above. A direct functionality for this RDD-level api would be super awesome to have. {quote}There's already a rowNumber function in Spark SQL, however, which sounds like the native equivalent?{quote} That's not the same. rowNumber requires an expression to `order by` on. What if there is no such column? For example, we often have files that we ingest into Spark and where we physical position of a record is meaningful how that record has to be processed. I can give exact example if you're necessary. zipWithIndex() actually the only one API call that preserves such information from original source (even though it can be distributed into multiple partitions etc.). Also as folks in that stackoverflow question said, rowNumber approach is way slower (and it's not surprizing as it requires data sorting). > Dataframe-ified zipwithindex > > > Key: SPARK-23074 > URL: https://issues.apache.org/jira/browse/SPARK-23074 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Ruslan Dautkhanov >Priority: Minor > Labels: dataframe, rdd > > Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex(): > {code:java} > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.types.{LongType, StructField, StructType} > import org.apache.spark.sql.Row > def dfZipWithIndex( > df: DataFrame, > offset: Int = 1, > colName: String = "id", > inFront: Boolean = true > ) : DataFrame = { > df.sqlContext.createDataFrame( > df.rdd.zipWithIndex.map(ln => > Row.fromSeq( > (if (inFront) Seq(ln._2 + offset) else Seq()) > ++ ln._1.toSeq ++ > (if (inFront) Seq() else Seq(ln._2 + offset)) > ) > ), > StructType( > (if (inFront) Array(StructField(colName,LongType,false)) else > Array[StructField]()) > ++ df.schema.fields ++ > (if (inFront) Array[StructField]() else > Array(StructField(colName,LongType,false))) > ) > ) > } > {code} > credits: > [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326422#comment-16326422 ] Roque Vassal'lo commented on SPARK-6305: Hi, I would like to know current status of this ticket. Log4j 1 reached EOL two years ago, and current log4j 2 version (2.10) has compatibility with log4j 1 properties files (which I have read in this thread that it is a must). I have run some tests using log4j's bridge and slf4j, and it seems to work fine and it does not need properties files to be modified. Will this be a good approach to keep compatibility while getting new log4j 2 functionality and bug-fixing? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23081) Add colRegex API to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23081: Summary: Add colRegex API to PySpark (was: Add colRegex to PySpark) > Add colRegex API to PySpark > --- > > Key: SPARK-23081 > URL: https://issues.apache.org/jira/browse/SPARK-23081 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23081) Add colRegex to PySpark
Xiao Li created SPARK-23081: --- Summary: Add colRegex to PySpark Key: SPARK-23081 URL: https://issues.apache.org/jira/browse/SPARK-23081 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.3.0 Reporter: Xiao Li Fix For: 2.3.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23081) Add colRegex to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23081: Fix Version/s: (was: 2.3.0) > Add colRegex to PySpark > --- > > Key: SPARK-23081 > URL: https://issues.apache.org/jira/browse/SPARK-23081 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-12139: --- Assignee: jane > REGEX Column Specification for Hive Queries > --- > > Key: SPARK-12139 > URL: https://issues.apache.org/jira/browse/SPARK-12139 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Derek Sabry >Assignee: jane >Priority: Minor > Fix For: 2.3.0 > > > When executing a query of the form > Select `(a)?\+.\+` from A, > Hive would interpret this query as a regular expression, which can be > supported in the hive parser for spark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326390#comment-16326390 ] Oz Ben-Ami commented on SPARK-23078: [~mgaido] no objections, Kubernetes is the one that most needs it anyway. > Allow Submitting Spark Thrift Server in Cluster Mode > > > Key: SPARK-23078 > URL: https://issues.apache.org/jira/browse/SPARK-23078 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit, SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running > in Cluster mode, since at the time it was not able to do so successfully. I > have confirmed that Spark Thrift Server can run on Cluster mode in > Kubernetes, by commenting out > [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] > I have not had a chance to test against YARN. Since Kubernetes does not have > Client mode, this change is necessary to run Spark Thrift Service in > Kubernetes. > [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-12139. - Resolution: Fixed Fix Version/s: 2.3.0 > REGEX Column Specification for Hive Queries > --- > > Key: SPARK-12139 > URL: https://issues.apache.org/jira/browse/SPARK-12139 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Derek Sabry >Priority: Minor > Fix For: 2.3.0 > > > When executing a query of the form > Select `(a)?\+.\+` from A, > Hive would interpret this query as a regular expression, which can be > supported in the hive parser for spark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23080) Improve error message for built-in functions
[ https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23080: Assignee: (was: Apache Spark) > Improve error message for built-in functions > > > Key: SPARK-23080 > URL: https://issues.apache.org/jira/browse/SPARK-23080 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Trivial > > When a user puts the wrong number of parameters in a function, an > AnalysisException is thrown. If the function is a UDF, he user is told how > many parameters the function expected and how many he/she put. If the > function, instead, is a built-in one, no information about the number of > parameters expected and the actual one is provided. This can help in some > cases, to debug the errors (eg. bad quotes escaping may lead to a different > number of parameters than expected, etc. etc.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23080) Improve error message for built-in functions
[ https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23080: Assignee: Apache Spark > Improve error message for built-in functions > > > Key: SPARK-23080 > URL: https://issues.apache.org/jira/browse/SPARK-23080 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Trivial > > When a user puts the wrong number of parameters in a function, an > AnalysisException is thrown. If the function is a UDF, he user is told how > many parameters the function expected and how many he/she put. If the > function, instead, is a built-in one, no information about the number of > parameters expected and the actual one is provided. This can help in some > cases, to debug the errors (eg. bad quotes escaping may lead to a different > number of parameters than expected, etc. etc.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23080) Improve error message for built-in functions
[ https://issues.apache.org/jira/browse/SPARK-23080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326383#comment-16326383 ] Apache Spark commented on SPARK-23080: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20271 > Improve error message for built-in functions > > > Key: SPARK-23080 > URL: https://issues.apache.org/jira/browse/SPARK-23080 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Trivial > > When a user puts the wrong number of parameters in a function, an > AnalysisException is thrown. If the function is a UDF, he user is told how > many parameters the function expected and how many he/she put. If the > function, instead, is a built-in one, no information about the number of > parameters expected and the actual one is provided. This can help in some > cases, to debug the errors (eg. bad quotes escaping may lead to a different > number of parameters than expected, etc. etc.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22624) Expose range partitioning shuffle introduced by SPARK-22614
[ https://issues.apache.org/jira/browse/SPARK-22624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326382#comment-16326382 ] Xiao Li commented on SPARK-22624: - cc [~xubo245] Do you want to take this? > Expose range partitioning shuffle introduced by SPARK-22614 > --- > > Key: SPARK-22624 > URL: https://issues.apache.org/jira/browse/SPARK-22624 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326378#comment-16326378 ] Marco Gaido commented on SPARK-23078: - [~ozzieba] I see that in Kubernetes it might work, but I think it make no sense on any other RM. I think the best option would be therefore to allow it only on Kubernetes. > Allow Submitting Spark Thrift Server in Cluster Mode > > > Key: SPARK-23078 > URL: https://issues.apache.org/jira/browse/SPARK-23078 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit, SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running > in Cluster mode, since at the time it was not able to do so successfully. I > have confirmed that Spark Thrift Server can run on Cluster mode in > Kubernetes, by commenting out > [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] > I have not had a chance to test against YARN. Since Kubernetes does not have > Client mode, this change is necessary to run Spark Thrift Service in > Kubernetes. > [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326366#comment-16326366 ] Oz Ben-Ami edited comment on SPARK-23078 at 1/15/18 4:02 PM: - [~mgaido] In Kubernetes you can just create a Service (think in-cluster DNS+load balancing) which automatically connects to it, whichever node it's on was (Author: ozzieba): [~mgaido] In Kubernetes you can just create a Service which automatically connects to it, whichever node it's on > Allow Submitting Spark Thrift Server in Cluster Mode > > > Key: SPARK-23078 > URL: https://issues.apache.org/jira/browse/SPARK-23078 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit, SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running > in Cluster mode, since at the time it was not able to do so successfully. I > have confirmed that Spark Thrift Server can run on Cluster mode in > Kubernetes, by commenting out > [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] > I have not had a chance to test against YARN. Since Kubernetes does not have > Client mode, this change is necessary to run Spark Thrift Service in > Kubernetes. > [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326366#comment-16326366 ] Oz Ben-Ami commented on SPARK-23078: [~mgaido] In Kubernetes you can just create a Service which automatically connects to it, whichever node it's on > Allow Submitting Spark Thrift Server in Cluster Mode > > > Key: SPARK-23078 > URL: https://issues.apache.org/jira/browse/SPARK-23078 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit, SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running > in Cluster mode, since at the time it was not able to do so successfully. I > have confirmed that Spark Thrift Server can run on Cluster mode in > Kubernetes, by commenting out > [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] > I have not had a chance to test against YARN. Since Kubernetes does not have > Client mode, this change is necessary to run Spark Thrift Service in > Kubernetes. > [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23080) Improve error message for built-in functions
Marco Gaido created SPARK-23080: --- Summary: Improve error message for built-in functions Key: SPARK-23080 URL: https://issues.apache.org/jira/browse/SPARK-23080 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Marco Gaido When a user puts the wrong number of parameters in a function, an AnalysisException is thrown. If the function is a UDF, he user is told how many parameters the function expected and how many he/she put. If the function, instead, is a built-in one, no information about the number of parameters expected and the actual one is provided. This can help in some cases, to debug the errors (eg. bad quotes escaping may lead to a different number of parameters than expected, etc. etc.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23050) Structured Streaming with S3 file source duplicates data because of eventual consistency.
[ https://issues.apache.org/jira/browse/SPARK-23050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326330#comment-16326330 ] Steve Loughran edited comment on SPARK-23050 at 1/15/18 3:24 PM: - Quick review of the code Yes, there's potentially a failure if a cached 404 is picked up in taskCommit. It'd be slightly less brittle to return the array of URIs in the task commit message, have {{commitJob}} call getFileStatus() for each. That'd eliminate the problem except for any task committed immediately before commitJob & whose ref was still in the negative cache of the S3 load balances. It would also help catch the potential issue "file is lost between task commit and job commit". Even so, it'd be safe to do some little retry a bit ike ScalaTest;s {{eventually()}} to deal with that negative caching. it *should* only be for a few seconds, at worst (we don't have any real figures on it, it's usually so rarely seen, at least with the s3a client). Following the commit Job code path, {{HDFSMetadataLog}} could be made object store aware, and opt for a direct (atomic) overwrite of the log, rather than the write to temp & rename. Without that, time to commit becomes O(files) rather than (1) Oh, and the issue is the map in the commit job {code:java} val statuses: Seq[SinkFileStatus] = addedFiles.map(f => SinkFileStatus(fs.getFileStatus(new Path(f {code} Probing S3 for a single file can take a few hundred millis, as it's a single HEAD for a file, [more for a directory|https://steveloughran.blogspot.co.uk/2016/12/how-long-does-filesystemexists-take.html]. And the more queries you make, the higher the risk of S3 throttling. Doing that in the jobCommit amplifies the risk of throttling. More efficient would be to do a single list of the task working dir into a map, then use that to look up the files. approximately O(1) per directory, though as LIST consistency lags create consistency, it's actually less reliable. Finally, ORC may not write a file if there's nothing there to write, I believe...the file not being there isn't going to be a failure case (implication: any busy-wait for a file coming into existence should not wait that long) +[~fabbri], [~Thomas Demoor], [~ehiggs] was (Author: ste...@apache.org): Quick review of the code Yes, there's potentially a failure if a cached 404 is picked up in taskCommit. It'd be slightly less brittle to return the array of URIs in the task commit message, have {{commitJob}} call getFileStatus() for each. That'd eliminate the problem except for any task committed immediately before commitJob & whose ref was still in the negative cache of the S3 load balances. It would also help catch the potential issue "file is lost between task commit and job commit". Even so, it'd be safe to do some little retry a bit ike ScalaTest;s {{eventually()}} to deal with that negative caching. it *should* only be for a few seconds, at worst (we don't have any real figures on it, it's usually so rarely seen, at least with the s3a client). Following the commit Job code path, {{HDFSMetadataLog}} could be made object store aware, and opt for a direct (atomic) overwrite of the log, rather than the write to temp & rename. Without that, time to commit becomes O(files) rather than (1) > Structured Streaming with S3 file source duplicates data because of eventual > consistency. > - > > Key: SPARK-23050 > URL: https://issues.apache.org/jira/browse/SPARK-23050 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Yash Sharma >Priority: Major > > Spark Structured streaming with S3 file source duplicates data because of > eventual consistency. > Re producing the scenario - > - Structured streaming reading from S3 source. Writing back to S3. > - Spark tries to commitTask on completion of a task, by verifying if all the > files have been written to Filesystem. > {{ManifestFileCommitProtocol.commitTask}}. > - [Eventual consistency issue] Spark finds that the file is not present and > fails the task. {{org.apache.spark.SparkException: Task failed while writing > rows. No such file or directory > 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'}} > - By this time S3 eventually gets the file. > - Spark reruns the task and completes the task, but gets a new file name this > time. {{ManifestFileCommitProtocol.newTaskTempFile. > part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet.}} > - Data duplicates in results and the same data is processed twice and written > to S3. > - There is no data duplication if spark is able to list presence of all > committed files and all tasks succeed. >
[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326337#comment-16326337 ] Marco Gaido commented on SPARK-23078: - The problem is: in cluster mode you don't control where the thriftserver is launched. So how do you connect to it? I am not sure it makes sense to run it in cluster mode. > Allow Submitting Spark Thrift Server in Cluster Mode > > > Key: SPARK-23078 > URL: https://issues.apache.org/jira/browse/SPARK-23078 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit, SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running > in Cluster mode, since at the time it was not able to do so successfully. I > have confirmed that Spark Thrift Server can run on Cluster mode in > Kubernetes, by commenting out > [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] > I have not had a chance to test against YARN. Since Kubernetes does not have > Client mode, this change is necessary to run Spark Thrift Service in > Kubernetes. > [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23079) Fix query constraints propagation with aliases
[ https://issues.apache.org/jira/browse/SPARK-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23079: Assignee: (was: Apache Spark) > Fix query constraints propagation with aliases > -- > > Key: SPARK-23079 > URL: https://issues.apache.org/jira/browse/SPARK-23079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Priority: Major > > Previously, PR #19201 fix the problem of non-converging constraints. > After that PR #19149 improve the loop and constraints is inferred only once. > So the problem of non-converging constraints is gone. > Also, in current code, the case below will fail. > ``` > spark.range(5).write.saveAsTable("t") > val t = spark.read.table("t") > val left = t.withColumn("xid", $"id" + lit(1)).as("x") > val right = t.withColumnRenamed("id", "xid").as("y") > val df = left.join(right, "xid").filter("id = 3").toDF() > checkAnswer(df, Row(4, 3)) > ``` > Because of `aliasMap` replace all the aliased child. See the test case in PR > for details. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23079) Fix query constraints propagation with aliases
[ https://issues.apache.org/jira/browse/SPARK-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326334#comment-16326334 ] Apache Spark commented on SPARK-23079: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/20270 > Fix query constraints propagation with aliases > -- > > Key: SPARK-23079 > URL: https://issues.apache.org/jira/browse/SPARK-23079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Priority: Major > > Previously, PR #19201 fix the problem of non-converging constraints. > After that PR #19149 improve the loop and constraints is inferred only once. > So the problem of non-converging constraints is gone. > Also, in current code, the case below will fail. > ``` > spark.range(5).write.saveAsTable("t") > val t = spark.read.table("t") > val left = t.withColumn("xid", $"id" + lit(1)).as("x") > val right = t.withColumnRenamed("id", "xid").as("y") > val df = left.join(right, "xid").filter("id = 3").toDF() > checkAnswer(df, Row(4, 3)) > ``` > Because of `aliasMap` replace all the aliased child. See the test case in PR > for details. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23079) Fix query constraints propagation with aliases
[ https://issues.apache.org/jira/browse/SPARK-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23079: Assignee: Apache Spark > Fix query constraints propagation with aliases > -- > > Key: SPARK-23079 > URL: https://issues.apache.org/jira/browse/SPARK-23079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Previously, PR #19201 fix the problem of non-converging constraints. > After that PR #19149 improve the loop and constraints is inferred only once. > So the problem of non-converging constraints is gone. > Also, in current code, the case below will fail. > ``` > spark.range(5).write.saveAsTable("t") > val t = spark.read.table("t") > val left = t.withColumn("xid", $"id" + lit(1)).as("x") > val right = t.withColumnRenamed("id", "xid").as("y") > val df = left.join(right, "xid").filter("id = 3").toDF() > checkAnswer(df, Row(4, 3)) > ``` > Because of `aliasMap` replace all the aliased child. See the test case in PR > for details. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view
[ https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23035. - Resolution: Fixed Assignee: xubo245 Fix Version/s: 2.3.0 > Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > -- > > Key: SPARK-23035 > URL: https://issues.apache.org/jira/browse/SPARK-23035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: xubo245 >Priority: Major > Fix For: 2.3.0 > > > Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > There are warning when run test: test("rename temporary view - destination > table with database name") > {code:java} > 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE > TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW > ... USING ... instead > {code} > other test cases also have this warning > Another problem, it throw TempTableAlreadyExistsException and output > "Temporary table '$table' already exists" when we create temp view by using > org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's > improper. > {code:java} > /** >* Creates a global temp view, or issue an exception if the view already > exists and >* `overrideIfExists` is false. >*/ > def create( > name: String, > viewDefinition: LogicalPlan, > overrideIfExists: Boolean): Unit = synchronized { > if (!overrideIfExists && viewDefinitions.contains(name)) { > throw new TempTableAlreadyExistsException(name) > } > viewDefinitions.put(name, viewDefinition) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23050) Structured Streaming with S3 file source duplicates data because of eventual consistency.
[ https://issues.apache.org/jira/browse/SPARK-23050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326330#comment-16326330 ] Steve Loughran commented on SPARK-23050: Quick review of the code Yes, there's potentially a failure if a cached 404 is picked up in taskCommit. It'd be slightly less brittle to return the array of URIs in the task commit message, have {{commitJob}} call getFileStatus() for each. That'd eliminate the problem except for any task committed immediately before commitJob & whose ref was still in the negative cache of the S3 load balances. It would also help catch the potential issue "file is lost between task commit and job commit". Even so, it'd be safe to do some little retry a bit ike ScalaTest;s {{eventually()}} to deal with that negative caching. it *should* only be for a few seconds, at worst (we don't have any real figures on it, it's usually so rarely seen, at least with the s3a client). Following the commit Job code path, {{HDFSMetadataLog}} could be made object store aware, and opt for a direct (atomic) overwrite of the log, rather than the write to temp & rename. Without that, time to commit becomes O(files) rather than (1) > Structured Streaming with S3 file source duplicates data because of eventual > consistency. > - > > Key: SPARK-23050 > URL: https://issues.apache.org/jira/browse/SPARK-23050 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Yash Sharma >Priority: Major > > Spark Structured streaming with S3 file source duplicates data because of > eventual consistency. > Re producing the scenario - > - Structured streaming reading from S3 source. Writing back to S3. > - Spark tries to commitTask on completion of a task, by verifying if all the > files have been written to Filesystem. > {{ManifestFileCommitProtocol.commitTask}}. > - [Eventual consistency issue] Spark finds that the file is not present and > fails the task. {{org.apache.spark.SparkException: Task failed while writing > rows. No such file or directory > 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'}} > - By this time S3 eventually gets the file. > - Spark reruns the task and completes the task, but gets a new file name this > time. {{ManifestFileCommitProtocol.newTaskTempFile. > part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet.}} > - Data duplicates in results and the same data is processed twice and written > to S3. > - There is no data duplication if spark is able to list presence of all > committed files and all tasks succeed. > Code: > {code} > query = selected_df.writeStream \ > .format("parquet") \ > .option("compression", "snappy") \ > .option("path", "s3://path/data/") \ > .option("checkpointLocation", "s3://path/checkpoint/") \ > .start() > {code} > Same sized duplicate S3 Files: > {code} > $ aws s3 ls s3://path/data/ | grep part-00256 > 2018-01-11 03:37:00 17070 > part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet > 2018-01-11 03:37:10 17070 > part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet > {code} > Exception on S3 listing and task failure: > {code} > [Stage 5:>(277 + 100) / > 597]18/01/11 03:36:59 WARN TaskSetManager: Lost task 256.0 in stage 5.0 (TID > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory > 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:816) > at >
[jira] [Created] (SPARK-23079) Fix query constraints propagation with aliases
Gengliang Wang created SPARK-23079: -- Summary: Fix query constraints propagation with aliases Key: SPARK-23079 URL: https://issues.apache.org/jira/browse/SPARK-23079 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Gengliang Wang Previously, PR #19201 fix the problem of non-converging constraints. After that PR #19149 improve the loop and constraints is inferred only once. So the problem of non-converging constraints is gone. Also, in current code, the case below will fail. ``` spark.range(5).write.saveAsTable("t") val t = spark.read.table("t") val left = t.withColumn("xid", $"id" + lit(1)).as("x") val right = t.withColumnRenamed("id", "xid").as("y") val df = left.join(right, "xid").filter("id = 3").toDF() checkAnswer(df, Row(4, 3)) ``` Because of `aliasMap` replace all the aliased child. See the test case in PR for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23076) When we call cache() on RDD which depends on ShuffleRowRDD, we will get an error result
[ https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326304#comment-16326304 ] Wenchen Fan commented on SPARK-23076: - To be clear, you maintain an internal Spark version and cache the MapPartitionsRDD by changing the Spark SQL code base? Then this is not a bug... > When we call cache() on RDD which depends on ShuffleRowRDD, we will get an > error result > --- > > Key: SPARK-23076 > URL: https://issues.apache.org/jira/browse/SPARK-23076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Major > Attachments: shufflerowrdd-cache.png > > > For query below: > {code:java} > select * from csv_demo limit 3; > {code} > The correct result should be: > 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Joe|20| > |Tom|30| > |Hyukjin|25| > ++++-- > However,when we call cache on MapPartitionsRDD below: > !shufflerowrdd-cache.png! > Then result will be error: > 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Hyukjin|25| > |Hyukjin|25| > |Hyukjin|25| > ++++-- > The reason why this happen is that: > UnsafeRow which generated by ShuffleRowRDD#compute will use the same under > byte buffer > I print some log below to explain this: > Modify UnsafeRow.toString() > {code:java} > // This is for debugging > @Override > public String toString() { > StringBuilder build = new StringBuilder("["); > for (int i = 0; i < sizeInBytes; i += 8) { > if (i != 0) build.append(','); > build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, > baseOffset + i))); > } > build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and > baseOffset here > return build.toString(); > }{code} > {code:java} > 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] > 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] > 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] > {code} > we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD > iterator when config is true,like below: > {code:java} > reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { > if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) > && x._2.isInstanceOf[UnsafeRow]) { > (x._2).asInstanceOf[UnsafeRow].copy() > } else { > x._2 > } > }) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
Oz Ben-Ami created SPARK-23078: -- Summary: Allow Submitting Spark Thrift Server in Cluster Mode Key: SPARK-23078 URL: https://issues.apache.org/jira/browse/SPARK-23078 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Submit, SQL Affects Versions: 2.2.1, 2.2.0, 2.3.0 Reporter: Oz Ben-Ami Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running in Cluster mode, since at the time it was not able to do so successfully. I have confirmed that Spark Thrift Server can run on Cluster mode in Kubernetes, by commenting out [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] I have not had a chance to test against YARN. Since Kubernetes does not have Client mode, this change is necessary to run Spark Thrift Service in Kubernetes. [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23070) Bump previousSparkVersion in MimaBuild.scala to be 2.2.0
[ https://issues.apache.org/jira/browse/SPARK-23070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23070. - Resolution: Fixed > Bump previousSparkVersion in MimaBuild.scala to be 2.2.0 > > > Key: SPARK-23070 > URL: https://issues.apache.org/jira/browse/SPARK-23070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326303#comment-16326303 ] JP Moresmau commented on SPARK-18112: - I'm using Hive 2.3.2 and Spark 2.2.1, but I still run into this issue. Is there any specific configuration setting I should look for? > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22995) Spark UI stdout/stderr links point to executors internal address
[ https://issues.apache.org/jira/browse/SPARK-22995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jhon Cardenas resolved SPARK-22995. --- Resolution: Unresolved > Spark UI stdout/stderr links point to executors internal address > > > Key: SPARK-22995 > URL: https://issues.apache.org/jira/browse/SPARK-22995 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.1 > Environment: AWS EMR, yarn cluster. >Reporter: Jhon Cardenas >Priority: Major > Attachments: link.jpeg > > > On Spark ui, in Environment and Executors tabs, the links of stdout and > stderr point to the internal address of the executors. This would imply to > expose the executors so that links can be accessed. Shouldn't those links be > pointed to master then handled internally serving the master as a proxy for > these files instead of exposing the internal machines? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified
[ https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23029: Assignee: (was: Apache Spark) > Doc spark.shuffle.file.buffer units are kb when no units specified > -- > > Key: SPARK-23029 > URL: https://issues.apache.org/jira/browse/SPARK-23029 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Fernando Pereira >Priority: Minor > > When setting the spark.shuffle.file.buffer setting, even to its default > value, shuffles fail. > This appears to affect small to medium size partitions. Strangely the error > message is OutOfMemoryError, but it works with large partitions (at least > >32MB). > {code} > pyspark --conf "spark.shuffle.file.buffer=$((32*1024))" > /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit > pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768 > version 2.2.1 > >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", > >>> mode="overwrite") > [Stage 1:>(0 + 10) / > 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 11) > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedOutputStream.(BufferedOutputStream.java:75) > at > org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified
[ https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326260#comment-16326260 ] Apache Spark commented on SPARK-23029: -- User 'ferdonline' has created a pull request for this issue: https://github.com/apache/spark/pull/20269 > Doc spark.shuffle.file.buffer units are kb when no units specified > -- > > Key: SPARK-23029 > URL: https://issues.apache.org/jira/browse/SPARK-23029 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Fernando Pereira >Priority: Minor > > When setting the spark.shuffle.file.buffer setting, even to its default > value, shuffles fail. > This appears to affect small to medium size partitions. Strangely the error > message is OutOfMemoryError, but it works with large partitions (at least > >32MB). > {code} > pyspark --conf "spark.shuffle.file.buffer=$((32*1024))" > /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit > pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768 > version 2.2.1 > >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", > >>> mode="overwrite") > [Stage 1:>(0 + 10) / > 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 11) > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedOutputStream.(BufferedOutputStream.java:75) > at > org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified
[ https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23029: Assignee: Apache Spark > Doc spark.shuffle.file.buffer units are kb when no units specified > -- > > Key: SPARK-23029 > URL: https://issues.apache.org/jira/browse/SPARK-23029 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Fernando Pereira >Assignee: Apache Spark >Priority: Minor > > When setting the spark.shuffle.file.buffer setting, even to its default > value, shuffles fail. > This appears to affect small to medium size partitions. Strangely the error > message is OutOfMemoryError, but it works with large partitions (at least > >32MB). > {code} > pyspark --conf "spark.shuffle.file.buffer=$((32*1024))" > /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit > pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768 > version 2.2.1 > >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", > >>> mode="overwrite") > [Stage 1:>(0 + 10) / > 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 11) > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedOutputStream.(BufferedOutputStream.java:75) > at > org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21856: -- Assignee: Chunsheng Ji > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Chunsheng Ji >Priority: Minor > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21856: -- Assignee: (was: Weichen Xu) > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21856: -- Assignee: Weichen Xu > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-21856. Resolution: Fixed > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23076) When we call cache() on RDD which depends on ShuffleRowRDD, we will get an error result
[ https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-23076: - Summary: When we call cache() on RDD which depends on ShuffleRowRDD, we will get an error result (was: When we call cache() on ShuffleRowRDD, we will get an error result) > When we call cache() on RDD which depends on ShuffleRowRDD, we will get an > error result > --- > > Key: SPARK-23076 > URL: https://issues.apache.org/jira/browse/SPARK-23076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Major > Attachments: shufflerowrdd-cache.png > > > For query below: > {code:java} > select * from csv_demo limit 3; > {code} > The correct result should be: > 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Joe|20| > |Tom|30| > |Hyukjin|25| > ++++-- > However,when we call cache on MapPartitionsRDD below: > !shufflerowrdd-cache.png! > Then result will be error: > 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Hyukjin|25| > |Hyukjin|25| > |Hyukjin|25| > ++++-- > The reason why this happen is that: > UnsafeRow which generated by ShuffleRowRDD#compute will use the same under > byte buffer > I print some log below to explain this: > Modify UnsafeRow.toString() > {code:java} > // This is for debugging > @Override > public String toString() { > StringBuilder build = new StringBuilder("["); > for (int i = 0; i < sizeInBytes; i += 8) { > if (i != 0) build.append(','); > build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, > baseOffset + i))); > } > build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and > baseOffset here > return build.toString(); > }{code} > {code:java} > 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] > 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] > 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] > {code} > we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD > iterator when config is true,like below: > {code:java} > reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { > if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) > && x._2.isInstanceOf[UnsafeRow]) { > (x._2).asInstanceOf[UnsafeRow].copy() > } else { > x._2 > } > }) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result
[ https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326242#comment-16326242 ] zhoukang edited comment on SPARK-23076 at 1/15/18 1:34 PM: --- As the picture i posted, i cached MapPartitionsRDD which depends on ShuffleRowRDD. We used this for our internal spark sql service [~cloud_fan] was (Author: cane): As the picture i posted, i cached MapPartitionsRDD which depends on ShuffleRowRDD. We used this our internal spark sql service [~cloud_fan] > When we call cache() on ShuffleRowRDD, we will get an error result > -- > > Key: SPARK-23076 > URL: https://issues.apache.org/jira/browse/SPARK-23076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Major > Attachments: shufflerowrdd-cache.png > > > For query below: > {code:java} > select * from csv_demo limit 3; > {code} > The correct result should be: > 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Joe|20| > |Tom|30| > |Hyukjin|25| > ++++-- > However,when we call cache on MapPartitionsRDD below: > !shufflerowrdd-cache.png! > Then result will be error: > 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Hyukjin|25| > |Hyukjin|25| > |Hyukjin|25| > ++++-- > The reason why this happen is that: > UnsafeRow which generated by ShuffleRowRDD#compute will use the same under > byte buffer > I print some log below to explain this: > Modify UnsafeRow.toString() > {code:java} > // This is for debugging > @Override > public String toString() { > StringBuilder build = new StringBuilder("["); > for (int i = 0; i < sizeInBytes; i += 8) { > if (i != 0) build.append(','); > build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, > baseOffset + i))); > } > build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and > baseOffset here > return build.toString(); > }{code} > {code:java} > 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] > 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] > 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] > {code} > we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD > iterator when config is true,like below: > {code:java} > reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { > if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) > && x._2.isInstanceOf[UnsafeRow]) { > (x._2).asInstanceOf[UnsafeRow].copy() > } else { > x._2 > } > }) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result
[ https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326242#comment-16326242 ] zhoukang commented on SPARK-23076: -- As the picture i posted, i cached MapPartitionsRDD which depends on ShuffleRowRDD. We used this our internal spark sql service [~cloud_fan] > When we call cache() on ShuffleRowRDD, we will get an error result > -- > > Key: SPARK-23076 > URL: https://issues.apache.org/jira/browse/SPARK-23076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Major > Attachments: shufflerowrdd-cache.png > > > For query below: > {code:java} > select * from csv_demo limit 3; > {code} > The correct result should be: > 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Joe|20| > |Tom|30| > |Hyukjin|25| > ++++-- > However,when we call cache on MapPartitionsRDD below: > !shufflerowrdd-cache.png! > Then result will be error: > 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; > ++++-- > |_c0|_c1| > ++++-- > |Hyukjin|25| > |Hyukjin|25| > |Hyukjin|25| > ++++-- > The reason why this happen is that: > UnsafeRow which generated by ShuffleRowRDD#compute will use the same under > byte buffer > I print some log below to explain this: > Modify UnsafeRow.toString() > {code:java} > // This is for debugging > @Override > public String toString() { > StringBuilder build = new StringBuilder("["); > for (int i = 0; i < sizeInBytes; i += 8) { > if (i != 0) build.append(','); > build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, > baseOffset + i))); > } > build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and > baseOffset here > return build.toString(); > }{code} > {code:java} > 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] > 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] > 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] > {code} > we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD > iterator when config is true,like below: > {code:java} > reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { > if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) > && x._2.isInstanceOf[UnsafeRow]) { > (x._2).asInstanceOf[UnsafeRow].copy() > } else { > x._2 > } > }) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result
[ https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-23076: - Description: For query below: {code:java} select * from csv_demo limit 3; {code} The correct result should be: 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; ++++-- |_c0|_c1| ++++-- |Joe|20| |Tom|30| |Hyukjin|25| ++++-- However,when we call cache on MapPartitionsRDD below: !shufflerowrdd-cache.png! Then result will be error: 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; ++++-- |_c0|_c1| ++++-- |Hyukjin|25| |Hyukjin|25| |Hyukjin|25| ++++-- The reason why this happen is that: UnsafeRow which generated by ShuffleRowRDD#compute will use the same under byte buffer I print some log below to explain this: Modify UnsafeRow.toString() {code:java} // This is for debugging @Override public String toString() { StringBuilder build = new StringBuilder("["); for (int i = 0; i < sizeInBytes; i += 8) { if (i != 0) build.append(','); build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, baseOffset + i))); } build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and baseOffset here return build.toString(); }{code} {code:java} 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] {code} we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD iterator when config is true,like below: {code:java} reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) && x._2.isInstanceOf[UnsafeRow]) { (x._2).asInstanceOf[UnsafeRow].copy() } else { x._2 } }) } {code} was: For query below: {code:java} select * from csv_demo limit 3; {code} The correct result should be: 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; +---+-++-- |_c0|_c1| +---+-++-- |Joe|20| |Tom|30| |Hyukjin|25| +---+-++-- However,when we call cache on ShuffleRowRDD(or RDD which depends on ShuffleRowRDD in one stage): !shufflerowrdd-cache.png! Then result will be error: 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; +---+-++-- |_c0|_c1| +---+-++-- |Hyukjin|25| |Hyukjin|25| |Hyukjin|25| +---+-++-- The reason why this happen is that: UnsafeRow which generated by ShuffleRowRDD#compute will use the same under byte buffer I print some log below to explain this: Modify UnsafeRow.toString() {code:java} // This is for debugging @Override public String toString() { StringBuilder build = new StringBuilder("["); for (int i = 0; i < sizeInBytes; i += 8) { if (i != 0) build.append(','); build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, baseOffset + i))); } build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and baseOffset here return build.toString(); }{code} {code:java} 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] {code} we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD iterator when config is true,like below: {code:java} reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) && x._2.isInstanceOf[UnsafeRow]) { (x._2).asInstanceOf[UnsafeRow].copy() } else { x._2 } }) } {code} > When we call cache() on ShuffleRowRDD, we will get an error result > -- > > Key: SPARK-23076 > URL: https://issues.apache.org/jira/browse/SPARK-23076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Major > Attachments: shufflerowrdd-cache.png > > > For query below: > {code:java} > select * from csv_demo limit 3; > {code} > The correct result should be: > 0: jdbc:hive2://10.108.230.228:1/> select * from
[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
[ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326195#comment-16326195 ] Fernando Pereira edited comment on SPARK-17998 at 1/15/18 12:35 PM: [~sams] Did you have the change to check with/without sql to confirm this? I don't know where you got it from, but with me only {{spark.sql.files.maxPartitionBytes}} worked was (Author: ferdonline): [~sams] Did you have the change to check with/without sql to confirm this? I don't know where you got it from, but with me only {{park.sql.files.maxPartitionBytes}} worked > Reading Parquet files coalesces parts into too few in-memory partitions > --- > > Key: SPARK-17998 > URL: https://issues.apache.org/jira/browse/SPARK-17998 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: Spark Standalone Cluster (not "local mode") > Windows 10 and Windows 7 > Python 3.x >Reporter: Shea Parkes >Priority: Major > > Reading a parquet ~file into a DataFrame is resulting in far too few > in-memory partitions. In prior versions of Spark, the resulting DataFrame > would have a number of partitions often equal to the number of parts in the > parquet folder. > Here's a minimal reproducible sample: > {quote} > df_first = session.range(start=1, end=1, numPartitions=13) > assert df_first.rdd.getNumPartitions() == 13 > assert session._sc.defaultParallelism == 6 > path_scrap = r"c:\scratch\scrap.parquet" > df_first.write.parquet(path_scrap) > df_second = session.read.parquet(path_scrap) > print(df_second.rdd.getNumPartitions()) > {quote} > The above shows only 7 partitions in the DataFrame that was created by > reading the Parquet back into memory for me. Why is it no longer just the > number of part files in the Parquet folder? (Which is 13 in the example > above.) > I'm filing this as a bug because it has gotten so bad that we can't work with > the underlying RDD without first repartitioning the DataFrame, which is > costly and wasteful. I really doubt this was the intended effect of moving > to Spark 2.0. > I've tried to research where the number of in-memory partitions is > determined, but my Scala skills have proven in-adequate. I'd be happy to dig > further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
[ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326195#comment-16326195 ] Fernando Pereira commented on SPARK-17998: -- [~sams] Did you have the change to check with/without sql to confirm this? I don't know where you got it from, but with me only {{park.sql.files.maxPartitionBytes}} worked > Reading Parquet files coalesces parts into too few in-memory partitions > --- > > Key: SPARK-17998 > URL: https://issues.apache.org/jira/browse/SPARK-17998 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: Spark Standalone Cluster (not "local mode") > Windows 10 and Windows 7 > Python 3.x >Reporter: Shea Parkes >Priority: Major > > Reading a parquet ~file into a DataFrame is resulting in far too few > in-memory partitions. In prior versions of Spark, the resulting DataFrame > would have a number of partitions often equal to the number of parts in the > parquet folder. > Here's a minimal reproducible sample: > {quote} > df_first = session.range(start=1, end=1, numPartitions=13) > assert df_first.rdd.getNumPartitions() == 13 > assert session._sc.defaultParallelism == 6 > path_scrap = r"c:\scratch\scrap.parquet" > df_first.write.parquet(path_scrap) > df_second = session.read.parquet(path_scrap) > print(df_second.rdd.getNumPartitions()) > {quote} > The above shows only 7 partitions in the DataFrame that was created by > reading the Parquet back into memory for me. Why is it no longer just the > number of part files in the Parquet folder? (Which is 13 in the example > above.) > I'm filing this as a bug because it has gotten so bad that we can't work with > the underlying RDD without first repartitioning the DataFrame, which is > costly and wasteful. I really doubt this was the intended effect of moving > to Spark 2.0. > I've tried to research where the number of in-memory partitions is > determined, but my Scala skills have proven in-adequate. I'd be happy to dig > further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result
[ https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326163#comment-16326163 ] Wenchen Fan commented on SPARK-23076: - How did you cache ShuffleRowRDD? It's kind of an intermedia internal RDD created by Spark SQL, we don't expect end users to touch it. > When we call cache() on ShuffleRowRDD, we will get an error result > -- > > Key: SPARK-23076 > URL: https://issues.apache.org/jira/browse/SPARK-23076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Major > Attachments: shufflerowrdd-cache.png > > > For query below: > {code:java} > select * from csv_demo limit 3; > {code} > The correct result should be: > 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; > +---+-++-- > |_c0|_c1| > +---+-++-- > |Joe|20| > |Tom|30| > |Hyukjin|25| > +---+-++-- > However,when we call cache on ShuffleRowRDD(or RDD which depends on > ShuffleRowRDD in one stage): > !shufflerowrdd-cache.png! > Then result will be error: > 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; > +---+-++-- > |_c0|_c1| > +---+-++-- > |Hyukjin|25| > |Hyukjin|25| > |Hyukjin|25| > +---+-++-- > The reason why this happen is that: > UnsafeRow which generated by ShuffleRowRDD#compute will use the same under > byte buffer > I print some log below to explain this: > Modify UnsafeRow.toString() > {code:java} > // This is for debugging > @Override > public String toString() { > StringBuilder build = new StringBuilder("["); > for (int i = 0; i < sizeInBytes; i += 8) { > if (i != 0) build.append(','); > build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, > baseOffset + i))); > } > build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and > baseOffset here > return build.toString(); > }{code} > {code:java} > 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] > 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] > 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] > {code} > we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD > iterator when config is true,like below: > {code:java} > reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { > if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) > && x._2.isInstanceOf[UnsafeRow]) { > (x._2).asInstanceOf[UnsafeRow].copy() > } else { > x._2 > } > }) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23077) Apache Structured Streaming: Bulk/Batch write support for Hive using streaming dataset
[ https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23077. --- Resolution: Invalid I'm not sure what you're reporting here, but this sounds like a question about your code. Start on the mailing list or StackOverflow. Jira is for clearly described problems or changes, and ideally, a proposed solution. > Apache Structured Streaming: Bulk/Batch write support for Hive using > streaming dataset > -- > > Key: SPARK-23077 > URL: https://issues.apache.org/jira/browse/SPARK-23077 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Pravin Agrawal >Priority: Minor > > Using Apache Spark 2.2: Structured Streaming, Create a program which reads > data from Kafka and write it to Hive. > Data incoming from Kafka topic @ 100 records/sec and write to Hive table. > **Hive Table Created:** > CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, > stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; > **Insert via Manual Hive Query:** > INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); > **Insert via spark structured streaming code:** > SparkConf conf = new SparkConf(); > conf.setAppName("testing"); > conf.setMaster("local[2]"); > conf.set("hive.metastore.uris", "thrift://localhost:9083"); > SparkSession session = > SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); > // workaround START: code to insert static data into hive > String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, > 'pravin', true)"; > session.sql(insertQuery); > // workaround END: > // Solution START > Dataset dataset = readFromKafka(sparkSession); // private method > reading data from Kafka's 'xyz' topic > // some code which writes dataset into hive table demo_user > // Solution END -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Bulk/Batch write support for Hive using streaming dataset
[ https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Agrawal updated SPARK-23077: --- Issue Type: Improvement (was: Bug) > Apache Structured Streaming: Bulk/Batch write support for Hive using > streaming dataset > -- > > Key: SPARK-23077 > URL: https://issues.apache.org/jira/browse/SPARK-23077 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Pravin Agrawal >Priority: Minor > > Using Apache Spark 2.2: Structured Streaming, Create a program which reads > data from Kafka and write it to Hive. > Data incoming from Kafka topic @ 100 records/sec and write to Hive table. > **Hive Table Created:** > CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, > stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; > **Insert via Manual Hive Query:** > INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); > **Insert via spark structured streaming code:** > SparkConf conf = new SparkConf(); > conf.setAppName("testing"); > conf.setMaster("local[2]"); > conf.set("hive.metastore.uris", "thrift://localhost:9083"); > SparkSession session = > SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); > // workaround START: code to insert static data into hive > String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, > 'pravin', true)"; > session.sql(insertQuery); > // workaround END: > // Solution START > Dataset dataset = readFromKafka(sparkSession); // private method > reading data from Kafka's 'xyz' topic > // some code which writes dataset into hive table demo_user > // Solution END -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23076) When we call cache() on ShuffleRowRDD, we will get an error result
[ https://issues.apache.org/jira/browse/SPARK-23076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326153#comment-16326153 ] Sean Owen commented on SPARK-23076: --- That sounds serious, although if true, I would expect a lot of stuff fails. Where are you calling cache by the way? Maybe CC [~cloud_fan] > When we call cache() on ShuffleRowRDD, we will get an error result > -- > > Key: SPARK-23076 > URL: https://issues.apache.org/jira/browse/SPARK-23076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Major > Attachments: shufflerowrdd-cache.png > > > For query below: > {code:java} > select * from csv_demo limit 3; > {code} > The correct result should be: > 0: jdbc:hive2://10.108.230.228:1/> select * from csv_demo limit 3; > +---+-++-- > |_c0|_c1| > +---+-++-- > |Joe|20| > |Tom|30| > |Hyukjin|25| > +---+-++-- > However,when we call cache on ShuffleRowRDD(or RDD which depends on > ShuffleRowRDD in one stage): > !shufflerowrdd-cache.png! > Then result will be error: > 0: jdbc:hive2://xxx/> select * from csv_demo limit 3; > +---+-++-- > |_c0|_c1| > +---+-++-- > |Hyukjin|25| > |Hyukjin|25| > |Hyukjin|25| > +---+-++-- > The reason why this happen is that: > UnsafeRow which generated by ShuffleRowRDD#compute will use the same under > byte buffer > I print some log below to explain this: > Modify UnsafeRow.toString() > {code:java} > // This is for debugging > @Override > public String toString() { > StringBuilder build = new StringBuilder("["); > for (int i = 0; i < sizeInBytes; i += 8) { > if (i != 0) build.append(','); > build.append(java.lang.Long.toHexString(Platform.getLong(baseObject, > baseOffset + i))); > } > build.append(","+baseObject+","+baseOffset+']'); // Print baseObject and > baseOffset here > return build.toString(); > }{code} > {code:java} > 2018-01-12,22:08:47,441 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,656f4a,3032,[B@6225ec90,16] > 2018-01-12,22:08:47,445 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180003,22,6d6f54,3033,[B@6225ec90,16] > 2018-01-12,22:08:47,448 INFO org.apache.spark.sql.execution.ShuffledRowRDD: > Read value: [0,180007,22,6e696a6b757948,3532,[B@6225ec90,16] > {code} > we can fix this by add a config,and copy UnsafeRow read by ShuffleRowRDD > iterator when config is true,like below: > {code:java} > reader.read().asInstanceOf[Iterator[Product2[Int, InternalRow]]].map(x => { > if (SparkEnv.get.conf.get(StaticSQLConf.UNSAFEROWRDD_CACHE_ENABLE) > && x._2.isInstanceOf[UnsafeRow]) { > (x._2).asInstanceOf[UnsafeRow].copy() > } else { > x._2 > } > }) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes
[ https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326151#comment-16326151 ] Nick Pentreath commented on SPARK-22943: Does the new estimator & model version of OHE solve this underlying issue? > OneHotEncoder supports manual specification of categorySizes > > > Key: SPARK-22943 > URL: https://issues.apache.org/jira/browse/SPARK-22943 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > OHE should support configurable categorySizes, as n-values in > http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. > which allows consistent and foreseeable conversion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Bulk/Batch write support for Hive using streaming dataset
[ https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Agrawal updated SPARK-23077: --- Description: Using Apache Spark 2.2: Structured Streaming, Create a program which reads data from Kafka and write it to Hive. Data incoming from Kafka topic @ 100 records/sec and write to Hive table. **Hive Table Created:** CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; **Insert via Manual Hive Query:** INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); **Insert via spark structured streaming code:** SparkConf conf = new SparkConf(); conf.setAppName("testing"); conf.setMaster("local[2]"); conf.set("hive.metastore.uris", "thrift://localhost:9083"); SparkSession session = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); // workaround START: code to insert static data into hive String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true)"; session.sql(insertQuery); // workaround END: // Solution START Dataset dataset = readFromKafka(sparkSession); // private method reading data from Kafka's 'xyz' topic // some code which writes dataset into hive table demo_user // Solution END was: Using Apache Spark 2.2: Structured Streaming, I am creating a program which reads data from Kafka and write it to Hive. I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec. **Hive Table Created:** CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; **Insert via Manual Hive Query:** INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); **Insert via spark structured streaming code:** SparkConf conf = new SparkConf(); conf.setAppName("testing"); conf.setMaster("local[2]"); conf.set("hive.metastore.uris", "thrift://localhost:9083"); SparkSession session = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); // workaround START: code to insert static data into hive String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true)"; session.sql(insertQuery); // workaround END: // Solution START Dataset dataset = readFromKafka(sparkSession); // private method reading data from Kafka's 'xyz' topic // some code which writes dataset into hive table demo_user // Solution END > Apache Structured Streaming: Bulk/Batch write support for Hive using > streaming dataset > -- > > Key: SPARK-23077 > URL: https://issues.apache.org/jira/browse/SPARK-23077 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Pravin Agrawal >Priority: Minor > > Using Apache Spark 2.2: Structured Streaming, Create a program which reads > data from Kafka and write it to Hive. > Data incoming from Kafka topic @ 100 records/sec and write to Hive table. > **Hive Table Created:** > CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, > stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; > **Insert via Manual Hive Query:** > INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); > **Insert via spark structured streaming code:** > SparkConf conf = new SparkConf(); > conf.setAppName("testing"); > conf.setMaster("local[2]"); > conf.set("hive.metastore.uris", "thrift://localhost:9083"); > SparkSession session = > SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); > // workaround START: code to insert static data into hive > String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, > 'pravin', true)"; > session.sql(insertQuery); > // workaround END: > // Solution START > Dataset dataset = readFromKafka(sparkSession); // private method > reading data from Kafka's 'xyz' topic > // some code which writes dataset into hive table demo_user > // Solution END -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Bulk/Batch write support for Hive using streaming dataset
[ https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Agrawal updated SPARK-23077: --- Summary: Apache Structured Streaming: Bulk/Batch write support for Hive using streaming dataset (was: Apache Structured Streaming: Unable to write streaming dataset into Hive) > Apache Structured Streaming: Bulk/Batch write support for Hive using > streaming dataset > -- > > Key: SPARK-23077 > URL: https://issues.apache.org/jira/browse/SPARK-23077 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Pravin Agrawal >Priority: Minor > > Using Apache Spark 2.2: Structured Streaming, I am creating a program which > reads data from Kafka and write it to Hive. > I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec. > **Hive Table Created:** > CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, > stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; > **Insert via Manual Hive Query:** > INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); > **Insert via spark structured streaming code:** > SparkConf conf = new SparkConf(); > conf.setAppName("testing"); > conf.setMaster("local[2]"); > conf.set("hive.metastore.uris", "thrift://localhost:9083"); > SparkSession session = > SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); > // workaround START: code to insert static data into hive > String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, > 'pravin', true)"; > session.sql(insertQuery); > // workaround END: > // Solution START > Dataset dataset = readFromKafka(sparkSession); // private method > reading data from Kafka's 'xyz' topic > // some code which writes dataset into hive table demo_user > // Solution END -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22993) checkpointInterval param doc should be clearer
[ https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22993. Resolution: Fixed > checkpointInterval param doc should be clearer > -- > > Key: SPARK-22993 > URL: https://issues.apache.org/jira/browse/SPARK-22993 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Trivial > > several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, > LDA, GBT), each of which silently ignores the parameter when the checkpoint > directory is not set on the spark context. This should be documented in the > param doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Unable to write streaming dataset into Hive
[ https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Agrawal updated SPARK-23077: --- Description: Using Apache Spark 2.2: Structured Streaming, I am creating a program which reads data from Kafka and write it to Hive. I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec. **Hive Table Created:** CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; **Insert via Manual Hive Query:** INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); **Insert via spark structured streaming code:** SparkConf conf = new SparkConf(); conf.setAppName("testing"); conf.setMaster("local[2]"); conf.set("hive.metastore.uris", "thrift://localhost:9083"); SparkSession session = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); // workaround START: code to insert static data into hive String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true)"; session.sql(insertQuery); // workaround END: // Solution START Dataset dataset = readFromKafka(sparkSession); // private method reading data from Kafka's 'xyz' topic // some code which writes dataset into hive table demo_user // Solution END was: Using Apache Spark 2.2: Structured Streaming, I am creating a program which reads data from Kafka and write it to Hive. I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec. **Hive Table Created:** CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; **Insert via Manual Hive Query:** INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); **Insert via spark structured streaming code:** SparkConf conf = new SparkConf(); conf.setAppName("testing"); conf.setMaster("local[2]"); conf.set("hive.metastore.uris", "thrift://localhost:9083"); SparkSession session = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); // workaround START: code to insert static data into hive String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true)"; session.sql(insertQuery); // workaround END: // Solution START Dataset dataset = readFromKafka(sparkSession); // private method reading data from Kafka's 'xyz' topic // **My question here:** // some code which writes dataset into hive table demo_user // Solution END > Apache Structured Streaming: Unable to write streaming dataset into Hive > > > Key: SPARK-23077 > URL: https://issues.apache.org/jira/browse/SPARK-23077 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Pravin Agrawal >Priority: Minor > > Using Apache Spark 2.2: Structured Streaming, I am creating a program which > reads data from Kafka and write it to Hive. > I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec. > **Hive Table Created:** > CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, > stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; > **Insert via Manual Hive Query:** > INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); > **Insert via spark structured streaming code:** > SparkConf conf = new SparkConf(); > conf.setAppName("testing"); > conf.setMaster("local[2]"); > conf.set("hive.metastore.uris", "thrift://localhost:9083"); > SparkSession session = > SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); > // workaround START: code to insert static data into hive > String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, > 'pravin', true)"; > session.sql(insertQuery); > // workaround END: > // Solution START > Dataset dataset = readFromKafka(sparkSession); // private method > reading data from Kafka's 'xyz' topic > // some code which writes dataset into hive table demo_user > // Solution END -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22993) checkpointInterval param doc should be clearer
[ https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-22993: -- Assignee: Seth Hendrickson > checkpointInterval param doc should be clearer > -- > > Key: SPARK-22993 > URL: https://issues.apache.org/jira/browse/SPARK-22993 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.3.0 >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Trivial > > several algorithms use the shared parameter {{HasCheckpointInterval}} (ALS, > LDA, GBT), each of which silently ignores the parameter when the checkpoint > directory is not set on the spark context. This should be documented in the > param doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Unable to write streaming dataset into Hive?
[ https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Agrawal updated SPARK-23077: --- Issue Type: Bug (was: Question) > Apache Structured Streaming: Unable to write streaming dataset into Hive? > - > > Key: SPARK-23077 > URL: https://issues.apache.org/jira/browse/SPARK-23077 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Pravin Agrawal >Priority: Minor > > Using Apache Spark 2.2: Structured Streaming, I am creating a program which > reads data from Kafka and write it to Hive. > I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec. > **Hive Table Created:** > CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, > stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; > **Insert via Manual Hive Query:** > INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); > **Insert via spark structured streaming code:** > SparkConf conf = new SparkConf(); > conf.setAppName("testing"); > conf.setMaster("local[2]"); > conf.set("hive.metastore.uris", "thrift://localhost:9083"); > SparkSession session = > SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); > // workaround START: code to insert static data into hive > String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, > 'pravin', true)"; > session.sql(insertQuery); > // workaround END: > // Solution START > Dataset dataset = readFromKafka(sparkSession); // private method > reading data from Kafka's 'xyz' topic > // **My question here:** > // some code which writes dataset into hive table demo_user > // Solution END -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Unable to write streaming dataset into Hive
[ https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Agrawal updated SPARK-23077: --- Summary: Apache Structured Streaming: Unable to write streaming dataset into Hive (was: Apache Structured Streaming: Unable to write streaming dataset into Hive?) > Apache Structured Streaming: Unable to write streaming dataset into Hive > > > Key: SPARK-23077 > URL: https://issues.apache.org/jira/browse/SPARK-23077 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Pravin Agrawal >Priority: Minor > > Using Apache Spark 2.2: Structured Streaming, I am creating a program which > reads data from Kafka and write it to Hive. > I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec. > **Hive Table Created:** > CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, > stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; > **Insert via Manual Hive Query:** > INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); > **Insert via spark structured streaming code:** > SparkConf conf = new SparkConf(); > conf.setAppName("testing"); > conf.setMaster("local[2]"); > conf.set("hive.metastore.uris", "thrift://localhost:9083"); > SparkSession session = > SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); > // workaround START: code to insert static data into hive > String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, > 'pravin', true)"; > session.sql(insertQuery); > // workaround END: > // Solution START > Dataset dataset = readFromKafka(sparkSession); // private method > reading data from Kafka's 'xyz' topic > // **My question here:** > // some code which writes dataset into hive table demo_user > // Solution END -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23077) Apache Structured Streaming: Unable to write streaming dataset into Hive?
[ https://issues.apache.org/jira/browse/SPARK-23077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Agrawal updated SPARK-23077: --- Summary: Apache Structured Streaming: Unable to write streaming dataset into Hive? (was: Apache Structured Streaming: How to write streaming dataset into Hive?) > Apache Structured Streaming: Unable to write streaming dataset into Hive? > - > > Key: SPARK-23077 > URL: https://issues.apache.org/jira/browse/SPARK-23077 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Pravin Agrawal >Priority: Minor > > Using Apache Spark 2.2: Structured Streaming, I am creating a program which > reads data from Kafka and write it to Hive. > I am looking for writing bulk data incoming in Kafka topic @ 100 records/sec. > **Hive Table Created:** > CREATE TABLE demo_user( timeaa BIGINT, numberbb INT, decimalcc DOUBLE, > stringdd STRING, booleanee BOOLEAN ) STORED AS ORC ; > **Insert via Manual Hive Query:** > INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, 'pravin', true); > **Insert via spark structured streaming code:** > SparkConf conf = new SparkConf(); > conf.setAppName("testing"); > conf.setMaster("local[2]"); > conf.set("hive.metastore.uris", "thrift://localhost:9083"); > SparkSession session = > SparkSession.builder().config(conf).enableHiveSupport().getOrCreate(); > // workaround START: code to insert static data into hive > String insertQuery = "INSERT INTO TABLE demo_user (1514133139123, 14, 26.4, > 'pravin', true)"; > session.sql(insertQuery); > // workaround END: > // Solution START > Dataset dataset = readFromKafka(sparkSession); // private method > reading data from Kafka's 'xyz' topic > // **My question here:** > // some code which writes dataset into hive table demo_user > // Solution END -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org