[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-06-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504316#comment-16504316
 ] 

Hyukjin Kwon commented on SPARK-24442:
--

I meant please avoid setting "Fix Version/s" field when the JIRA is created.

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24279) Incompatible byte code errors, when using test-jar of spark sql.

2018-06-06 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-24279.
-
Resolution: Invalid

> Incompatible byte code errors, when using test-jar of spark sql.
> 
>
> Key: SPARK-24279
> URL: https://issues.apache.org/jira/browse/SPARK-24279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Tests
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Using test libraries available with spark sql streaming, produces weird 
> incompatible byte code errors. It is already tested on a different virtual 
> box instance to make sure, that it is not related to my system environment. A 
> reproducer is uploaded to github.
> [https://github.com/ScrapCodes/spark-bug-reproducer]
>  
> Doing a clean build reproduces the error.
>  
> Verbatim paste of the error.
> {code:java}
> [INFO] Compiling 1 source files to 
> /home/prashant/work/test/target/test-classes at 1526380360990
> [ERROR] error: missing or invalid dependency detected while loading class 
> file 'QueryTest.class'.
> [INFO] Could not access type PlanTest in package 
> org.apache.spark.sql.catalyst.plans,
> [INFO] because it (or its dependencies) are missing. Check your build 
> definition for
> [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to 
> see the problematic classpath.)
> [INFO] A full rebuild may help if 'QueryTest.class' was compiled against an 
> incompatible version of org.apache.spark.sql.catalyst.plans.
> [ERROR] error: missing or invalid dependency detected while loading class 
> file 'SQLTestUtilsBase.class'.
> [INFO] Could not access type PlanTestBase in package 
> org.apache.spark.sql.catalyst.plans,
> [INFO] because it (or its dependencies) are missing. Check your build 
> definition for
> [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to 
> see the problematic classpath.)
> [INFO] A full rebuild may help if 'SQLTestUtilsBase.class' was compiled 
> against an incompatible version of org.apache.spark.sql.catalyst.plans.
> [ERROR] error: missing or invalid dependency detected while loading class 
> file 'SQLTestUtils.class'.
> [INFO] Could not access type PlanTest in package 
> org.apache.spark.sql.catalyst.plans,
> [INFO] because it (or its dependencies) are missing. Check your build 
> definition for
> [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to 
> see the problematic classpath.)
> [INFO] A full rebuild may help if 'SQLTestUtils.class' was compiled against 
> an incompatible version of org.apache.spark.sql.catalyst.plans.
> [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:25: 
> error: Unable to find encoder for type stored in a Dataset. Primitive types 
> (Int, String, etc) and Product types (case classes) are supported by 
> importing spark.implicits._ Support for serializing other types will be added 
> in future releases.
> [ERROR] val inputData = MemoryStream[Int]
> [ERROR] ^
> [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:30: 
> error: Unable to find encoder for type stored in a Dataset. Primitive types 
> (Int, String, etc) and Product types (case classes) are supported by 
> importing spark.implicits._ Support for serializing other types will be added 
> in future releases.
> [ERROR] CheckAnswer(2, 3, 4))
> [ERROR] ^
> [ERROR] 5 errors found
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24477) Import submodules under pyspark.ml by default

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24477:


Assignee: (was: Apache Spark)

> Import submodules under pyspark.ml by default
> -
>
> Key: SPARK-24477
> URL: https://issues.apache.org/jira/browse/SPARK-24477
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Right now, we do not import submodules under pyspark.ml by default. So users 
> cannot do
> {code}
> from pyspark import ml
> kmeans = ml.clustering.KMeans(...)
> {code}
> I create this JIRA to discuss if we should import the submodules by default. 
> It will change behavior of
> {code}
> from pyspark.ml import *
> {code}
> But it simplifies unnecessary imports.
> cc [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24477) Import submodules under pyspark.ml by default

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24477:


Assignee: Apache Spark

> Import submodules under pyspark.ml by default
> -
>
> Key: SPARK-24477
> URL: https://issues.apache.org/jira/browse/SPARK-24477
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Major
>
> Right now, we do not import submodules under pyspark.ml by default. So users 
> cannot do
> {code}
> from pyspark import ml
> kmeans = ml.clustering.KMeans(...)
> {code}
> I create this JIRA to discuss if we should import the submodules by default. 
> It will change behavior of
> {code}
> from pyspark.ml import *
> {code}
> But it simplifies unnecessary imports.
> cc [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24477) Import submodules under pyspark.ml by default

2018-06-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504247#comment-16504247
 ] 

Apache Spark commented on SPARK-24477:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21483

> Import submodules under pyspark.ml by default
> -
>
> Key: SPARK-24477
> URL: https://issues.apache.org/jira/browse/SPARK-24477
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Right now, we do not import submodules under pyspark.ml by default. So users 
> cannot do
> {code}
> from pyspark import ml
> kmeans = ml.clustering.KMeans(...)
> {code}
> I create this JIRA to discuss if we should import the submodules by default. 
> It will change behavior of
> {code}
> from pyspark.ml import *
> {code}
> But it simplifies unnecessary imports.
> cc [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24279) Incompatible byte code errors, when using test-jar of spark sql.

2018-06-06 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504246#comment-16504246
 ] 

Prashant Sharma commented on SPARK-24279:
-

Thanks a lot, that was the mistake. 

> Incompatible byte code errors, when using test-jar of spark sql.
> 
>
> Key: SPARK-24279
> URL: https://issues.apache.org/jira/browse/SPARK-24279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Tests
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Using test libraries available with spark sql streaming, produces weird 
> incompatible byte code errors. It is already tested on a different virtual 
> box instance to make sure, that it is not related to my system environment. A 
> reproducer is uploaded to github.
> [https://github.com/ScrapCodes/spark-bug-reproducer]
>  
> Doing a clean build reproduces the error.
>  
> Verbatim paste of the error.
> {code:java}
> [INFO] Compiling 1 source files to 
> /home/prashant/work/test/target/test-classes at 1526380360990
> [ERROR] error: missing or invalid dependency detected while loading class 
> file 'QueryTest.class'.
> [INFO] Could not access type PlanTest in package 
> org.apache.spark.sql.catalyst.plans,
> [INFO] because it (or its dependencies) are missing. Check your build 
> definition for
> [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to 
> see the problematic classpath.)
> [INFO] A full rebuild may help if 'QueryTest.class' was compiled against an 
> incompatible version of org.apache.spark.sql.catalyst.plans.
> [ERROR] error: missing or invalid dependency detected while loading class 
> file 'SQLTestUtilsBase.class'.
> [INFO] Could not access type PlanTestBase in package 
> org.apache.spark.sql.catalyst.plans,
> [INFO] because it (or its dependencies) are missing. Check your build 
> definition for
> [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to 
> see the problematic classpath.)
> [INFO] A full rebuild may help if 'SQLTestUtilsBase.class' was compiled 
> against an incompatible version of org.apache.spark.sql.catalyst.plans.
> [ERROR] error: missing or invalid dependency detected while loading class 
> file 'SQLTestUtils.class'.
> [INFO] Could not access type PlanTest in package 
> org.apache.spark.sql.catalyst.plans,
> [INFO] because it (or its dependencies) are missing. Check your build 
> definition for
> [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to 
> see the problematic classpath.)
> [INFO] A full rebuild may help if 'SQLTestUtils.class' was compiled against 
> an incompatible version of org.apache.spark.sql.catalyst.plans.
> [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:25: 
> error: Unable to find encoder for type stored in a Dataset. Primitive types 
> (Int, String, etc) and Product types (case classes) are supported by 
> importing spark.implicits._ Support for serializing other types will be added 
> in future releases.
> [ERROR] val inputData = MemoryStream[Int]
> [ERROR] ^
> [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:30: 
> error: Unable to find encoder for type stored in a Dataset. Primitive types 
> (Int, String, etc) and Product types (case classes) are supported by 
> importing spark.implicits._ Support for serializing other types will be added 
> in future releases.
> [ERROR] CheckAnswer(2, 3, 4))
> [ERROR] ^
> [ERROR] 5 errors found
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24475) Nested JSON count() Exception

2018-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24475.
--
Resolution: Duplicate

> Nested JSON count() Exception
> -
>
> Key: SPARK-24475
> URL: https://issues.apache.org/jira/browse/SPARK-24475
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph Toth
>Priority: Major
>
> I have nested structure json file only 2 rows.
>  
> {{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}}
> {{jsonData = spark.read.json(file)}}
> {{jsonData.count() will crash with the following exception, jsonData.head(10) 
> works.}}{{}}
>  
> Traceback (most recent call last):
>  File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line 
> 2882, in run_code
>  exec(code_obj, self.user_global_ns, self.user_ns)
>  File "", line 1, in 
>  jsonData.count()
>  File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line 
> 455, in count
>  return int(self._jdf.count())
>  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 
> 1160, in __call__
>  answer, self.gateway_client, self.target_id, self.name)
>  File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, 
> in deco
>  return f(*a, **kw)
>  File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in 
> get_return_value
>  format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o411.count.
> : java.lang.IllegalArgumentException
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
>  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>  at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
>  at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
>  at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
>  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
>  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2770)
>  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2769)
>  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
>  at org.apache.spark.sql.Dataset.count(Dataset.scala:2769)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.b

[jira] [Commented] (SPARK-24475) Nested JSON count() Exception

2018-06-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504235#comment-16504235
 ] 

Hyukjin Kwon commented on SPARK-24475:
--

I don't think Spark currently support Java 9 and 10 yet. Let's leave this as a 
duplicate of SPARK-24417

> Nested JSON count() Exception
> -
>
> Key: SPARK-24475
> URL: https://issues.apache.org/jira/browse/SPARK-24475
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph Toth
>Priority: Major
>
> I have nested structure json file only 2 rows.
>  
> {{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}}
> {{jsonData = spark.read.json(file)}}
> {{jsonData.count() will crash with the following exception, jsonData.head(10) 
> works.}}{{}}
>  
> Traceback (most recent call last):
>  File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line 
> 2882, in run_code
>  exec(code_obj, self.user_global_ns, self.user_ns)
>  File "", line 1, in 
>  jsonData.count()
>  File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line 
> 455, in count
>  return int(self._jdf.count())
>  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 
> 1160, in __call__
>  answer, self.gateway_client, self.target_id, self.name)
>  File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, 
> in deco
>  return f(*a, **kw)
>  File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in 
> get_return_value
>  format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o411.count.
> : java.lang.IllegalArgumentException
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
>  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>  at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
>  at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
>  at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
>  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
>  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2770)
>  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2769)
>  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
>  at org.apache.spark.sql.Dataset.count(Dataset.scala:2769)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Nati

[jira] [Comment Edited] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

2018-06-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504213#comment-16504213
 ] 

Liang-Chi Hsieh edited comment on SPARK-24447 at 6/7/18 4:09 AM:
-

I just built Spark from current 2.3 branch. The above example code also works 
well on it.


was (Author: viirya):
I just build Spark from current 2.3 branch. The above example code also works 
well.

> Pyspark RowMatrix.columnSimilarities() loses spark context
> --
>
> Key: SPARK-24447
> URL: https://issues.apache.org/jira/browse/SPARK-24447
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Perry Chu
>Priority: Major
>
> The RDD behind the CoordinateMatrix returned by 
> RowMatrix.columnSimilarities() appears to be losing track of the spark 
> context. 
> I'm pretty new to spark - not sure if the problem is on the python side or 
> the scala side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #
> print(sims.entries.context) # PySparkShell>, then throws an error{code}
> Error stack trace
> {code:java}
> ---
> AttributeError Traceback (most recent call last)
>  in ()
> > 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
> partitions, allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

2018-06-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504213#comment-16504213
 ] 

Liang-Chi Hsieh commented on SPARK-24447:
-

I just build Spark from current 2.3 branch. The above example code also works 
well.

> Pyspark RowMatrix.columnSimilarities() loses spark context
> --
>
> Key: SPARK-24447
> URL: https://issues.apache.org/jira/browse/SPARK-24447
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Perry Chu
>Priority: Major
>
> The RDD behind the CoordinateMatrix returned by 
> RowMatrix.columnSimilarities() appears to be losing track of the spark 
> context. 
> I'm pretty new to spark - not sure if the problem is on the python side or 
> the scala side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #
> print(sims.entries.context) # PySparkShell>, then throws an error{code}
> Error stack trace
> {code:java}
> ---
> AttributeError Traceback (most recent call last)
>  in ()
> > 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
> partitions, allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-06 Thread Xinyong Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504202#comment-16504202
 ] 

Xinyong Tian commented on SPARK-24431:
--

I also feel it is reasonable to set first point as (0,p). In fact, as long as 
it is not (0,1), aucPR will be small enough for a model that predicts same p 
for all examples, so cross validation will not select such model.

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-06 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504171#comment-16504171
 ] 

Li Yuanjian commented on SPARK-24375:
-

Got it, great thanks for your detailed explanation.

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24475) Nested JSON count() Exception

2018-06-06 Thread Joseph Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504163#comment-16504163
 ] 

Joseph Toth commented on SPARK-24475:
-

It looks like it was my version of java on this machine. It was 9 jre. I 
upgraded to 10jdk, but received the error at the bottom. Then I downgraded to 
8jdk and it worked. Thanks a bunch for getting back to me!
 
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-2-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
 
 
❯ java --version
openjdk 10.0.1 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10-Debian-4)
OpenJDK 64-Bit Server VM (build 10.0.1+10-Debian-4, mixed mode)
 
❯ spark-shell --master 'local[2]'
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
2018-06-06 22:08:16 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
 
Failed to initialize compiler: object java.lang.Object in compiler mirror not 
found.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programmatically, settings.usejavacp.value = true.
 
Failed to initialize compiler: object java.lang.Object in compiler mirror not 
found.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programmatically, settings.usejavacp.value = true.
Exception in thread "main" java.lang.NullPointerException
        at 
scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
        at 
scala.tools.nsc.interpreter.IMain$Request.x$20$lzycompute(IMain.scala:896)
        at scala.tools.nsc.interpreter.IMain$Request.x$20(IMain.scala:895)
        at 
scala.tools.nsc.interpreter.IMain$Request.headerPreamble$lzycompute(IMain.scala:895)
        at 
scala.tools.nsc.interpreter.IMain$Request.headerPreamble(IMain.scala:895)
        at 
scala.tools.nsc.interpreter.IMain$Request$Wrapper.preamble(IMain.scala:918)
        at 
scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1337)
        at 
scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1336)
        at scala.tools.nsc.util.package$.stringFromWriter(package.scala:64)
        at 
scala.tools.nsc.interpreter.IMain$CodeAssembler$class.apply(IMain.scala:1336)
        at 
scala.tools.nsc.interpreter.IMain$Request$Wrapper.apply(IMain.scala:908)
        at 
scala.tools.nsc.interpreter.IMain$Request.compile$lzycompute(IMain.scala:1002)
        at scala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:997)
        at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:579)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)

> Nested JSON count() Exception
> -
>
> Key: SPARK-24475
> URL: https://issues.apache.org/jira/browse/SPARK-24475
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph Toth
>Priority: Major
>
> I have nested structure json file only 2 rows.
>  
> {{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}}
> {{jsonData = spark.read.json(file)}}
> {{jsonData.count() will crash with the following exception, jsonData.head(10) 
> works.}}{{}}
>  
> Traceback (most recent call last):
>  File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line 
> 2882, in run_code
>  exec(code_obj, self.user_global_ns, self.user_ns)
>  File "", line 1, in 
>  jsonData.count()
>  File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line 
> 455, in count
>  return int(self._jdf.count())
>  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 
> 1160, in __call__
>  answer, self.gateway_client, self.target_id, self.name)
>  File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, 
> in deco
>  return f(*a, **kw)
>  File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in 
> get_return_value
>  format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o411.count.
> : java.lang.IllegalArgumentException
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at 
> org.apache.spark.util.ClosureCleaner$.

[jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

2018-06-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504152#comment-16504152
 ] 

Liang-Chi Hsieh commented on SPARK-24447:
-

Yes, I can run the example code on a build from latest source code.

Because I can't reproduce it with your provided example on latest codebase, I'd 
like to know if it is a problem in Spark codebase or not. Since you said the 
affect version is 2.3.0, I will try it on 2.3.0 codebase to see if there is a 
bug.

> Pyspark RowMatrix.columnSimilarities() loses spark context
> --
>
> Key: SPARK-24447
> URL: https://issues.apache.org/jira/browse/SPARK-24447
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Perry Chu
>Priority: Major
>
> The RDD behind the CoordinateMatrix returned by 
> RowMatrix.columnSimilarities() appears to be losing track of the spark 
> context. 
> I'm pretty new to spark - not sure if the problem is on the python side or 
> the scala side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #
> print(sims.entries.context) # PySparkShell>, then throws an error{code}
> Error stack trace
> {code:java}
> ---
> AttributeError Traceback (most recent call last)
>  in ()
> > 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
> partitions, allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24475) Nested JSON count() Exception

2018-06-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504134#comment-16504134
 ] 

Hyukjin Kwon commented on SPARK-24475:
--

I can't reproduce this. What's your version and env? does other JSON file work 
fine?

> Nested JSON count() Exception
> -
>
> Key: SPARK-24475
> URL: https://issues.apache.org/jira/browse/SPARK-24475
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Joseph Toth
>Priority: Major
>
> I have nested structure json file only 2 rows.
>  
> {{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}}
> {{jsonData = spark.read.json(file)}}
> {{jsonData.count() will crash with the following exception, jsonData.head(10) 
> works.}}{{}}
>  
> Traceback (most recent call last):
>  File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line 
> 2882, in run_code
>  exec(code_obj, self.user_global_ns, self.user_ns)
>  File "", line 1, in 
>  jsonData.count()
>  File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line 
> 455, in count
>  return int(self._jdf.count())
>  File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 
> 1160, in __call__
>  answer, self.gateway_client, self.target_id, self.name)
>  File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, 
> in deco
>  return f(*a, **kw)
>  File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in 
> get_return_value
>  format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o411.count.
> : java.lang.IllegalArgumentException
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.(Unknown Source)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
>  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>  at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
>  at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
>  at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>  at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
>  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
>  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2770)
>  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2769)
>  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
>  at org.apache.spark.sql.Dataset.count(Dataset.scala:2769)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at

[jira] [Updated] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24479:
-
Affects Version/s: 2.4.0

> Register StreamingQueryListener in Spark Conf 
> --
>
> Key: SPARK-24479
> URL: https://issues.apache.org/jira/browse/SPARK-24479
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mingjie Tang
>Priority: Major
>  Labels: feature
>
> Users need to register their own StreamingQueryListener into 
> StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS 
> and QUERY_EXECUTION_LISTENERS. 
> We propose to provide STREAMING_QUERY_LISTENER Conf for user to register 
> their own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24480) Add a config to register custom StreamingQueryListeners

2018-06-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504121#comment-16504121
 ] 

Hyukjin Kwon commented on SPARK-24480:
--

(let's avoid to set the fix version which is usually set when it's actually 
fixed)

> Add a config to register custom StreamingQueryListeners
> ---
>
> Key: SPARK-24480
> URL: https://issues.apache.org/jira/browse/SPARK-24480
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Minor
>
> Currently a "StreamingQueryListener" can only be registered programatically. 
> We could have a new config "spark.sql.streamingQueryListeners" similar to  
> "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to 
> register custom streaming listeners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24480) Add a config to register custom StreamingQueryListeners

2018-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24480:
-
Fix Version/s: (was: 2.4.0)

> Add a config to register custom StreamingQueryListeners
> ---
>
> Key: SPARK-24480
> URL: https://issues.apache.org/jira/browse/SPARK-24480
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Minor
>
> Currently a "StreamingQueryListener" can only be registered programatically. 
> We could have a new config "spark.sql.streamingQueryListeners" similar to  
> "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to 
> register custom streaming listeners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-06-06 Thread Arun Mahadevan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504120#comment-16504120
 ] 

Arun Mahadevan commented on SPARK-24479:


PR raised - https://github.com/apache/spark/pull/21504

> Register StreamingQueryListener in Spark Conf 
> --
>
> Key: SPARK-24479
> URL: https://issues.apache.org/jira/browse/SPARK-24479
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Mingjie Tang
>Priority: Major
>  Labels: feature
>
> Users need to register their own StreamingQueryListener into 
> StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS 
> and QUERY_EXECUTION_LISTENERS. 
> We propose to provide STREAMING_QUERY_LISTENER Conf for user to register 
> their own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24479:


Assignee: Apache Spark

> Register StreamingQueryListener in Spark Conf 
> --
>
> Key: SPARK-24479
> URL: https://issues.apache.org/jira/browse/SPARK-24479
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Mingjie Tang
>Assignee: Apache Spark
>Priority: Major
>  Labels: feature
>
> Users need to register their own StreamingQueryListener into 
> StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS 
> and QUERY_EXECUTION_LISTENERS. 
> We propose to provide STREAMING_QUERY_LISTENER Conf for user to register 
> their own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-06-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504119#comment-16504119
 ] 

Apache Spark commented on SPARK-24479:
--

User 'arunmahadevan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21504

> Register StreamingQueryListener in Spark Conf 
> --
>
> Key: SPARK-24479
> URL: https://issues.apache.org/jira/browse/SPARK-24479
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Mingjie Tang
>Priority: Major
>  Labels: feature
>
> Users need to register their own StreamingQueryListener into 
> StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS 
> and QUERY_EXECUTION_LISTENERS. 
> We propose to provide STREAMING_QUERY_LISTENER Conf for user to register 
> their own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24479:


Assignee: (was: Apache Spark)

> Register StreamingQueryListener in Spark Conf 
> --
>
> Key: SPARK-24479
> URL: https://issues.apache.org/jira/browse/SPARK-24479
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Mingjie Tang
>Priority: Major
>  Labels: feature
>
> Users need to register their own StreamingQueryListener into 
> StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS 
> and QUERY_EXECUTION_LISTENERS. 
> We propose to provide STREAMING_QUERY_LISTENER Conf for user to register 
> their own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24480) Add a config to register custom StreamingQueryListeners

2018-06-06 Thread Arun Mahadevan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Mahadevan resolved SPARK-24480.

Resolution: Duplicate

Issue is already raised. Duplicate of SPARK-24479

> Add a config to register custom StreamingQueryListeners
> ---
>
> Key: SPARK-24480
> URL: https://issues.apache.org/jira/browse/SPARK-24480
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently a "StreamingQueryListener" can only be registered programatically. 
> We could have a new config "spark.sql.streamingQueryListeners" similar to  
> "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to 
> register custom streaming listeners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Description: 
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://";,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
 

Exception:
{code:java}
18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method 
"project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method 
"project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB
at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1392)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$3.apply(SparkPlan.scala:167)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:164)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:61)
at 
org.apache.

[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Description: 
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://";,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
 

Exception:
{code:java}
18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method 
"project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method 
"project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB{code}
 

Log file is attached

  was:
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://";,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached


> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0 and Databricks Cloud 4.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://";,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_n

[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Environment: Emr 5.13.0 and Databricks Cloud 4.0  (was: Emr 5.13.0)

> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0 and Databricks Cloud 4.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://";,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
> Log file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Description: 
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://";,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached

  was:
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
// Databricks notebook source
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://";,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached


> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://";,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
> Log file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Description: 
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
// Databricks notebook source
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://";,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300){
  a = a + s"when action like '$i%' THEN '$i' "
}
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached

  was:
Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
// Databricks notebook source
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://";,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300)
a = a + s"when action like '$i%' THEN '$i' "
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached


> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> // Databricks notebook source
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://";,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
> Log file is attached



--

[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Conegliano updated SPARK-24481:
--
Attachment: log4j-active(1).log

> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> // Databricks notebook source
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://";,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300)
> a = a + s"when action like '$i%' THEN '$i' "
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
> Log file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24480) Add a config to register custom StreamingQueryListeners

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24480:


Assignee: (was: Apache Spark)

> Add a config to register custom StreamingQueryListeners
> ---
>
> Key: SPARK-24480
> URL: https://issues.apache.org/jira/browse/SPARK-24480
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently a "StreamingQueryListener" can only be registered programatically. 
> We could have a new config "spark.sql.streamingQueryListeners" similar to  
> "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to 
> register custom streaming listeners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24480) Add a config to register custom StreamingQueryListeners

2018-06-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504113#comment-16504113
 ] 

Apache Spark commented on SPARK-24480:
--

User 'arunmahadevan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21504

> Add a config to register custom StreamingQueryListeners
> ---
>
> Key: SPARK-24480
> URL: https://issues.apache.org/jira/browse/SPARK-24480
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently a "StreamingQueryListener" can only be registered programatically. 
> We could have a new config "spark.sql.streamingQueryListeners" similar to  
> "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to 
> register custom streaming listeners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24480) Add a config to register custom StreamingQueryListeners

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24480:


Assignee: Apache Spark

> Add a config to register custom StreamingQueryListeners
> ---
>
> Key: SPARK-24480
> URL: https://issues.apache.org/jira/browse/SPARK-24480
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently a "StreamingQueryListener" can only be registered programatically. 
> We could have a new config "spark.sql.streamingQueryListeners" similar to  
> "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to 
> register custom streaming listeners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-06 Thread Andrew Conegliano (JIRA)
Andrew Conegliano created SPARK-24481:
-

 Summary: GeneratedIteratorForCodegenStage1 grows beyond 64 KB
 Key: SPARK-24481
 URL: https://issues.apache.org/jira/browse/SPARK-24481
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: Emr 5.13.0
Reporter: Andrew Conegliano
 Attachments: log4j-active(1).log

Similar to other "grows beyond 64 KB" errors.  Happens with large case 
statement:
{code:java}
// Databricks notebook source
import org.apache.spark.sql.functions._
import scala.collection.mutable
import org.apache.spark.sql.Column

var rdd = sc.parallelize(Array("""{
"event":
{
"timestamp": 1521086591110,
"event_name": "yu",
"page":
{
"page_url": "https://";,
"page_name": "es"
},
"properties":
{
"id": "87",
"action": "action",
"navigate_action": "navigate_action"
}
}
}
"""))

var df = spark.read.json(rdd)
df = 
df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
.toDF("id","event_time","url","action","page_name","event_name","navigation_action")

var a = "case "
for(i <- 1 to 300)
a = a + s"when action like '$i%' THEN '$i' "
a = a + " else null end as task_id"

val expression = expr(a)

df = df.filter("id is not null and id <> '' and event_time is not null")

val transformationExpressions: mutable.HashMap[String, Column] = 
mutable.HashMap(
"action" -> expr("coalesce(action, navigation_action) as action"),
"task_id" -> expression
)

for((col, expr) <- transformationExpressions)
df = df.withColumn(col, expr)

df = df.filter("(action is not null and action <> '') or (page_name is not null 
and page_name <> '')")

df.show
{code}
Log file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24480) Add a config to register custom StreamingQueryListeners

2018-06-06 Thread Arun Mahadevan (JIRA)
Arun Mahadevan created SPARK-24480:
--

 Summary: Add a config to register custom StreamingQueryListeners
 Key: SPARK-24480
 URL: https://issues.apache.org/jira/browse/SPARK-24480
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Arun Mahadevan
 Fix For: 2.4.0


Currently a "StreamingQueryListener" can only be registered programatically. We 
could have a new config "spark.sql.streamingQueryListeners" similar to  
"spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to 
register custom streaming listeners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-06-06 Thread Mingjie Tang (JIRA)
Mingjie Tang created SPARK-24479:


 Summary: Register StreamingQueryListener in Spark Conf 
 Key: SPARK-24479
 URL: https://issues.apache.org/jira/browse/SPARK-24479
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Mingjie Tang


Users need to register their own StreamingQueryListener into 
StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS and 
QUERY_EXECUTION_LISTENERS. 

We propose to provide STREAMING_QUERY_LISTENER Conf for user to register their 
own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24478:


Assignee: (was: Apache Spark)

> DataSourceV2 should push filters and projection at physical plan conversion
> ---
>
> Key: SPARK-24478
> URL: https://issues.apache.org/jira/browse/SPARK-24478
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 currently pushes filters and projected columns in the optimized 
> plan, but this requires creating and configuring a reader multiple times and 
> prevents the v2 relation's output from being a fixed argument of the relation 
> case class. It is also much cleaner (see PR 
> [#21262|https://github.com/apache/spark/pull/21262]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion

2018-06-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503874#comment-16503874
 ] 

Apache Spark commented on SPARK-24478:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21503

> DataSourceV2 should push filters and projection at physical plan conversion
> ---
>
> Key: SPARK-24478
> URL: https://issues.apache.org/jira/browse/SPARK-24478
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 currently pushes filters and projected columns in the optimized 
> plan, but this requires creating and configuring a reader multiple times and 
> prevents the v2 relation's output from being a fixed argument of the relation 
> case class. It is also much cleaner (see PR 
> [#21262|https://github.com/apache/spark/pull/21262]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24478:


Assignee: Apache Spark

> DataSourceV2 should push filters and projection at physical plan conversion
> ---
>
> Key: SPARK-24478
> URL: https://issues.apache.org/jira/browse/SPARK-24478
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> DataSourceV2 currently pushes filters and projected columns in the optimized 
> plan, but this requires creating and configuring a reader multiple times and 
> prevents the v2 relation's output from being a fixed argument of the relation 
> case class. It is also much cleaner (see PR 
> [#21262|https://github.com/apache/spark/pull/21262]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them

2018-06-06 Thread Joel Croteau (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503835#comment-16503835
 ] 

Joel Croteau commented on SPARK-24357:
--

[~viirya], yes that's what I said. What I am saying is that when an integer 
that is too large to fit in a long occurs, the schema should either be inferred 
as a type that can hold it, or it should be a run-time error.

> createDataFrame in Python infers large integers as long type and then fails 
> silently when converting them
> -
>
> Key: SPARK-24357
> URL: https://issues.apache.org/jira/browse/SPARK-24357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Major
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark 
> SQL will infer any integral type as a LongType, which is a 64-bit integer, 
> without actually checking whether the values will fit into a 64-bit slot. If 
> the values are larger than 64 bits, then when pickled and unpickled in Java, 
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is 
> called, it will ignore the BigInteger type and return Null. This results in 
> any large integers in the resulting DataFrame being silently converted to 
> None. This can create some very surprising and difficult to debug behavior, 
> in particular if you are not aware of this limitation. There should either be 
> a runtime error at some point in this conversion chain, or else _infer_type 
> should infer larger integers as DecimalType with appropriate precision, or as 
> BinaryType. The former would be less convenient, but the latter may be 
> problematic to implement in practice. In any case, we should stop silently 
> converting large integers to None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion

2018-06-06 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-24478:
-

 Summary: DataSourceV2 should push filters and projection at 
physical plan conversion
 Key: SPARK-24478
 URL: https://issues.apache.org/jira/browse/SPARK-24478
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue


DataSourceV2 currently pushes filters and projected columns in the optimized 
plan, but this requires creating and configuring a reader multiple times and 
prevents the v2 relation's output from being a fixed argument of the relation 
case class. It is also much cleaner (see PR 
[#21262|https://github.com/apache/spark/pull/21262]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24469) Support collations in Spark SQL

2018-06-06 Thread Alexander Shkapsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503723#comment-16503723
 ] 

Alexander Shkapsky commented on SPARK-24469:


*first* or *min* on a StringType column gets planned with SortAggregateExec due 
to the StringType aggregate result.  So same performance problems.

> Support collations in Spark SQL
> ---
>
> Key: SPARK-24469
> URL: https://issues.apache.org/jira/browse/SPARK-24469
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Shkapsky
>Priority: Major
>
> One of our use cases is to support case-insensitive comparison in operations, 
> including aggregation and text comparison filters.  Another use case is to 
> sort via collator.  Support for collations throughout the query processor 
> appear to be the proper way to support these needs.
> Language-based worked arounds (for the aggregation case) are insufficient:
>  # SELECT UPPER(text)GROUP BY UPPER(text)
> introduces invalid values into the output set
>  # SELECT MIN(text)...GROUP BY UPPER(text) 
> results in poor performance in our case, in part due to use of sort-based 
> aggregate
> Examples of collation support in RDBMS:
>  * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html]
>  * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html]
>  * 
> [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html]
>  * [SQL 
> Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017]
>  * 
> [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24477) Import submodules under pyspark.ml by default

2018-06-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503710#comment-16503710
 ] 

Hyukjin Kwon commented on SPARK-24477:
--

Thanks for cc'ing me, [~mengxr]. Will do this too tomorrow.

> Import submodules under pyspark.ml by default
> -
>
> Key: SPARK-24477
> URL: https://issues.apache.org/jira/browse/SPARK-24477
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Right now, we do not import submodules under pyspark.ml by default. So users 
> cannot do
> {code}
> from pyspark import ml
> kmeans = ml.clustering.KMeans(...)
> {code}
> I create this JIRA to discuss if we should import the submodules by default. 
> It will change behavior of
> {code}
> from pyspark.ml import *
> {code}
> But it simplifies unnecessary imports.
> cc [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24477) Import submodules under pyspark.ml by default

2018-06-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24477:
--
Description: 
Right now, we do not import submodules under pyspark.ml by default. So users 
cannot do

{code}
from pyspark import ml
kmeans = ml.clustering.KMeans(...)
{code}

I create this JIRA to discuss if we should import the submodules by default. It 
will change behavior of

{code}
from pyspark.ml import *
{code}

But it simplifies unnecessary imports.

cc [~hyukjin.kwon]

  was:
Right now, we do not import submodules under pyspark.ml by default. So users 
cannot do

{code}
from pyspark import ml
kmeans = ml.clustering.KMeans(...)
{code}

I create this JIRA to discuss if we should import the submodules by default. It 
will change behavior of

{code}
from pyspark.ml import *
{code}

cc [~hyukjin.kwon]


> Import submodules under pyspark.ml by default
> -
>
> Key: SPARK-24477
> URL: https://issues.apache.org/jira/browse/SPARK-24477
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Right now, we do not import submodules under pyspark.ml by default. So users 
> cannot do
> {code}
> from pyspark import ml
> kmeans = ml.clustering.KMeans(...)
> {code}
> I create this JIRA to discuss if we should import the submodules by default. 
> It will change behavior of
> {code}
> from pyspark.ml import *
> {code}
> But it simplifies unnecessary imports.
> cc [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24454) ml.image doesn't have __all__ explicitly defined

2018-06-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24454:
--
Issue Type: Improvement  (was: Bug)

> ml.image doesn't have __all__ explicitly defined
> 
>
> Key: SPARK-24454
> URL: https://issues.apache.org/jira/browse/SPARK-24454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> ml/image.py doesn't have __all__ explicitly defined. It will import all 
> global names by default (only ImageSchema for now), which is not a good 
> practice. We should add __all__ to image.py.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24454) ml.image doesn't have __all__ explicitly defined

2018-06-06 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503700#comment-16503700
 ] 

Xiangrui Meng commented on SPARK-24454:
---

Updated this JIRA and created SPARK-24477.

> ml.image doesn't have __all__ explicitly defined
> 
>
> Key: SPARK-24454
> URL: https://issues.apache.org/jira/browse/SPARK-24454
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> ml/image.py doesn't have __all__ explicitly defined. It will import all 
> global names by default (only ImageSchema for now), which is not a good 
> practice. We should add __all__ to image.py.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24454) ml.image doesn't have __all__ explicitly defined

2018-06-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24454:
--
Priority: Minor  (was: Major)

> ml.image doesn't have __all__ explicitly defined
> 
>
> Key: SPARK-24454
> URL: https://issues.apache.org/jira/browse/SPARK-24454
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> ml/image.py doesn't have __all__ explicitly defined. It will import all 
> global names by default (only ImageSchema for now), which is not a good 
> practice. We should add __all__ to image.py.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24477) Import submodules under pyspark.ml by default

2018-06-06 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-24477:
-

 Summary: Import submodules under pyspark.ml by default
 Key: SPARK-24477
 URL: https://issues.apache.org/jira/browse/SPARK-24477
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


Right now, we do not import submodules under pyspark.ml by default. So users 
cannot do

{code}
from pyspark import ml
kmeans = ml.clustering.KMeans(...)
{code}

I create this JIRA to discuss if we should import the submodules by default. It 
will change behavior of

{code}
from pyspark.ml import *
{code}

cc [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24454) ml.image doesn't have __all__ explicitly defined

2018-06-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24454:
--
Description: ml/image.py doesn't have __all__ explicitly defined. It will 
import all global names by default (only ImageSchema for now), which is not a 
good practice. We should add __all__ to image.py.  (was: {code:java}
from pyspark import ml
ml.image{code}

The ml.image line will fail because we don't have __all__ defined there. )

> ml.image doesn't have __all__ explicitly defined
> 
>
> Key: SPARK-24454
> URL: https://issues.apache.org/jira/browse/SPARK-24454
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> ml/image.py doesn't have __all__ explicitly defined. It will import all 
> global names by default (only ImageSchema for now), which is not a good 
> practice. We should add __all__ to image.py.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24454) ml.image doesn't have __all__ explicitly defined

2018-06-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24454:
--
Summary: ml.image doesn't have __all__ explicitly defined  (was: ml.image 
doesn't have default imports)

> ml.image doesn't have __all__ explicitly defined
> 
>
> Key: SPARK-24454
> URL: https://issues.apache.org/jira/browse/SPARK-24454
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> {code:java}
> from pyspark import ml
> ml.image{code}
> The ml.image line will fail because we don't have __all__ defined there. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24469) Support collations in Spark SQL

2018-06-06 Thread Eric Maynard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503687#comment-16503687
 ] 

Eric Maynard commented on SPARK-24469:
--

Ah, I see, I was wrongly thinking of the second case where you use e.g. MIN to 
get some legitimate input value. But I can see how *min* would yield bad 
performance. 
Maybe try *first* instead?

> Support collations in Spark SQL
> ---
>
> Key: SPARK-24469
> URL: https://issues.apache.org/jira/browse/SPARK-24469
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Shkapsky
>Priority: Major
>
> One of our use cases is to support case-insensitive comparison in operations, 
> including aggregation and text comparison filters.  Another use case is to 
> sort via collator.  Support for collations throughout the query processor 
> appear to be the proper way to support these needs.
> Language-based worked arounds (for the aggregation case) are insufficient:
>  # SELECT UPPER(text)GROUP BY UPPER(text)
> introduces invalid values into the output set
>  # SELECT MIN(text)...GROUP BY UPPER(text) 
> results in poor performance in our case, in part due to use of sort-based 
> aggregate
> Examples of collation support in RDBMS:
>  * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html]
>  * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html]
>  * 
> [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html]
>  * [SQL 
> Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017]
>  * 
> [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

2018-06-06 Thread Perry Chu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503678#comment-16503678
 ] 

Perry Chu commented on SPARK-24447:
---

Do you mean building from source code? I haven't done that before, but I can 
try. If you can suggest a good guide to follow, I would appreciate it.

> Pyspark RowMatrix.columnSimilarities() loses spark context
> --
>
> Key: SPARK-24447
> URL: https://issues.apache.org/jira/browse/SPARK-24447
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Perry Chu
>Priority: Major
>
> The RDD behind the CoordinateMatrix returned by 
> RowMatrix.columnSimilarities() appears to be losing track of the spark 
> context. 
> I'm pretty new to spark - not sure if the problem is on the python side or 
> the scala side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #
> print(sims.entries.context) # PySparkShell>, then throws an error{code}
> Error stack trace
> {code:java}
> ---
> AttributeError Traceback (most recent call last)
>  in ()
> > 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
> partitions, allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility

2018-06-06 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-24466:
-
Target Version/s: 2.4.0

> TextSocketMicroBatchReader no longer works with nc utility
> --
>
> Key: SPARK-24466
> URL: https://issues.apache.org/jira/browse/SPARK-24466
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before 
> reading actual data so the query also exits with error.
>  
> The reason is due to launching temporary reader for reading schema, and 
> closing reader, and re-opening reader. While reliable socket server should be 
> able to handle this without any issue, nc command normally can't handle 
> multiple connections and simply exits when closing temporary reader.
>  
> Given that socket source is expected to be used from examples on official 
> document or some experiments, which we tend to simply use netcat, this is 
> better to be treated as bug, though this is a kind of limitation on netcat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-06 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503647#comment-16503647
 ] 

Jiang Xingbo commented on SPARK-24375:
--

The major problem is that tasks in the same stage of a MPI workload may rely on 
the internal results of other parallel running folk tasks to compute the final 
results, thus when a task fail, other tasks in the same stage may generate 
incorrect result or even hang, and it seems to be straight-forward to just 
retry the whole stage on task failure.

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24279) Incompatible byte code errors, when using test-jar of spark sql.

2018-06-06 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503630#comment-16503630
 ] 

Shixiong Zhu commented on SPARK-24279:
--

Just did a quick look at your pom.xml. I think it's missing the following 
catalyst test jar.
{code}

  org.apache.spark
  spark-catalyst_${scala.binary.version}
  ${project.version}
  test-jar
  test

{code}

> Incompatible byte code errors, when using test-jar of spark sql.
> 
>
> Key: SPARK-24279
> URL: https://issues.apache.org/jira/browse/SPARK-24279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Tests
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Using test libraries available with spark sql streaming, produces weird 
> incompatible byte code errors. It is already tested on a different virtual 
> box instance to make sure, that it is not related to my system environment. A 
> reproducer is uploaded to github.
> [https://github.com/ScrapCodes/spark-bug-reproducer]
>  
> Doing a clean build reproduces the error.
>  
> Verbatim paste of the error.
> {code:java}
> [INFO] Compiling 1 source files to 
> /home/prashant/work/test/target/test-classes at 1526380360990
> [ERROR] error: missing or invalid dependency detected while loading class 
> file 'QueryTest.class'.
> [INFO] Could not access type PlanTest in package 
> org.apache.spark.sql.catalyst.plans,
> [INFO] because it (or its dependencies) are missing. Check your build 
> definition for
> [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to 
> see the problematic classpath.)
> [INFO] A full rebuild may help if 'QueryTest.class' was compiled against an 
> incompatible version of org.apache.spark.sql.catalyst.plans.
> [ERROR] error: missing or invalid dependency detected while loading class 
> file 'SQLTestUtilsBase.class'.
> [INFO] Could not access type PlanTestBase in package 
> org.apache.spark.sql.catalyst.plans,
> [INFO] because it (or its dependencies) are missing. Check your build 
> definition for
> [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to 
> see the problematic classpath.)
> [INFO] A full rebuild may help if 'SQLTestUtilsBase.class' was compiled 
> against an incompatible version of org.apache.spark.sql.catalyst.plans.
> [ERROR] error: missing or invalid dependency detected while loading class 
> file 'SQLTestUtils.class'.
> [INFO] Could not access type PlanTest in package 
> org.apache.spark.sql.catalyst.plans,
> [INFO] because it (or its dependencies) are missing. Check your build 
> definition for
> [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to 
> see the problematic classpath.)
> [INFO] A full rebuild may help if 'SQLTestUtils.class' was compiled against 
> an incompatible version of org.apache.spark.sql.catalyst.plans.
> [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:25: 
> error: Unable to find encoder for type stored in a Dataset. Primitive types 
> (Int, String, etc) and Product types (case classes) are supported by 
> importing spark.implicits._ Support for serializing other types will be added 
> in future releases.
> [ERROR] val inputData = MemoryStream[Int]
> [ERROR] ^
> [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:30: 
> error: Unable to find encoder for type stored in a Dataset. Primitive types 
> (Int, String, etc) and Product types (case classes) are supported by 
> importing spark.implicits._ Support for serializing other types will be added 
> in future releases.
> [ERROR] CheckAnswer(2, 3, 4))
> [ERROR] ^
> [ERROR] 5 errors found
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24469) Support collations in Spark SQL

2018-06-06 Thread Alexander Shkapsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503621#comment-16503621
 ] 

Alexander Shkapsky commented on SPARK-24469:


A simple case with a single input value "George Washington" - UPPER("George 
Washington") will produce a group with "GEORGE WASHINGTON".  "GEORGE 
WASHINGTON" is not present in the input.

> Support collations in Spark SQL
> ---
>
> Key: SPARK-24469
> URL: https://issues.apache.org/jira/browse/SPARK-24469
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Shkapsky
>Priority: Major
>
> One of our use cases is to support case-insensitive comparison in operations, 
> including aggregation and text comparison filters.  Another use case is to 
> sort via collator.  Support for collations throughout the query processor 
> appear to be the proper way to support these needs.
> Language-based worked arounds (for the aggregation case) are insufficient:
>  # SELECT UPPER(text)GROUP BY UPPER(text)
> introduces invalid values into the output set
>  # SELECT MIN(text)...GROUP BY UPPER(text) 
> results in poor performance in our case, in part due to use of sort-based 
> aggregate
> Examples of collation support in RDBMS:
>  * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html]
>  * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html]
>  * 
> [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html]
>  * [SQL 
> Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017]
>  * 
> [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24469) Support collations in Spark SQL

2018-06-06 Thread Eric Maynard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503586#comment-16503586
 ] 

Eric Maynard commented on SPARK-24469:
--

bq. SELECT UPPER(text)GROUP BY UPPER(text)
bq. introduces invalid values into the output set

Can you elaborate on this?

> Support collations in Spark SQL
> ---
>
> Key: SPARK-24469
> URL: https://issues.apache.org/jira/browse/SPARK-24469
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Shkapsky
>Priority: Major
>
> One of our use cases is to support case-insensitive comparison in operations, 
> including aggregation and text comparison filters.  Another use case is to 
> sort via collator.  Support for collations throughout the query processor 
> appear to be the proper way to support these needs.
> Language-based worked arounds (for the aggregation case) are insufficient:
>  # SELECT UPPER(text)GROUP BY UPPER(text)
> introduces invalid values into the output set
>  # SELECT MIN(text)...GROUP BY UPPER(text) 
> results in poor performance in our case, in part due to use of sort-based 
> aggregate
> Examples of collation support in RDBMS:
>  * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html]
>  * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html]
>  * 
> [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html]
>  * [SQL 
> Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017]
>  * 
> [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0

2018-06-06 Thread bharath kumar avusherla (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla updated SPARK-24476:

Description: 
We are working on spark streaming application using spark structured streaming 
with checkpointing in s3. When we start the application, the application runs 
just fine for sometime  then it crashes with the error mentioned below. The 
amount of time it will run successfully varies from time to time, sometimes it 
will run for 2 days without any issues then crashes, sometimes it will crash 
after 4hrs/ 24hrs. 

Our streaming application joins(left and inner) multiple sources from kafka and 
also s3 and aurora database.

Can you please let us know how to solve this problem?
Is it possible to somehow tweak the SocketTimeout-Time? 
Here, I'm pasting the few line of complete exception log below. Also attached 
the complete exception to the issue.

*_Exception:_*

*_Caused by: java.net.SocketTimeoutException: Read timed out_*

        _at java.net.SocketInputStream.socketRead0(Native Method)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:150)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:121)_

        _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_

        _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_

        _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_

        _at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_

 

  was:
We are working on spark streaming application using spark structured streaming 
with checkpointing in s3. When we start the application, the application runs 
just fine for sometime  then it crashes with the error mentioned below. The 
amount of time it will run successfully varies from time to time, sometimes it 
will run for 2 days without any issues then crashes, sometimes it will crash 
after 4hrs/ 24hrs. 

Our streaming application joins(left and inner) multiple sources from kafka and 
also s3 and aurora database.

Can you please let us know how to solve this problem? Is it possible to 
increase the timeout period?

Here, I'm pasting the few line of complete exception log below. Also attached 
the complete exception to the issue.

*_Exception:_*

*_Caused by: java.net.SocketTimeoutException: Read timed out_*

        _at java.net.SocketInputStream.socketRead0(Native Method)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:150)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:121)_

        _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_

        _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_

        _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_

        _at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_

 


> java.net.SocketTimeoutException: Read timed out Exception while running the 
> Spark Structured Streaming in 2.3.0
> ---
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Major
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem?
> Is it possible to somehow tweak the SocketTimeout-Time? 
> Here, I'm pasting the few line of 

[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0

2018-06-06 Thread bharath kumar avusherla (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla updated SPARK-24476:

Description: 
We are working on spark streaming application using spark structured streaming 
with checkpointing in s3. When we start the application, the application runs 
just fine for sometime  then it crashes with the error mentioned below. The 
amount of time it will run successfully varies from time to time, sometimes it 
will run for 2 days without any issues then crashes, sometimes it will crash 
after 4hrs/ 24hrs. 

Our streaming application joins(left and inner) multiple sources from kafka and 
also s3 and aurora database.

Can you please let us know how to solve this problem? Is it possible to 
increase the timeout period?

Here, I'm pasting the few line of complete exception log below. Also attached 
the complete exception to the issue.

*_Exception:_*

*_Caused by: java.net.SocketTimeoutException: Read timed out_*

        _at java.net.SocketInputStream.socketRead0(Native Method)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:150)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:121)_

        _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_

        _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_

        _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_

        _at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_

 

  was:
We are working on spark streaming application using spark structured streaming 
with checkpointing in s3. When we start the application, the application runs 
just fine for sometime  then it crashes with the error mentioned below. The 
amount of time it will run successfully varies from time to time, sometimes it 
will run for 2 days without any issues then crashes, sometimes it will crash 
after 4hrs/ 24hrs. 

Our streaming application joins(left and inner) multiple sources from kafka and 
also s3 and aurora database.

Can you please let us know how to solve this problem? Is it possible to 
increase the timeout period?

Here, I'm pasting the complete exception log below.

*_Exception:_*

*_Caused by: java.net.SocketTimeoutException: Read timed out_*

        _at java.net.SocketInputStream.socketRead0(Native Method)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:150)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:121)_

        _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_

        _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_

        _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_

        _at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_

        _at 
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:179)_

        _at 
org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144)_

        _at 
org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:134)_

        _at 
org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:612)_

        _at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:447)_

        _at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)_

        _at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)_

        _at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)_

        _at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)_

        _at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)_

        _at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942)_

        _at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148)_

        _at 
org.jets3t.service.im

[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0

2018-06-06 Thread bharath kumar avusherla (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla updated SPARK-24476:

Attachment: socket-timeout-exception

> java.net.SocketTimeoutException: Read timed out Exception while running the 
> Spark Structured Streaming in 2.3.0
> ---
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Major
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem? Is it possible to 
> increase the timeout period?
> Here, I'm pasting the complete exception log below.
> *_Exception:_*
> *_Caused by: java.net.SocketTimeoutException: Read timed out_*
>         _at java.net.SocketInputStream.socketRead0(Native Method)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:150)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:121)_
>         _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_
>         _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_
>         _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_
>         _at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_
>         _at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:179)_
>         _at 
> org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144)_
>         _at 
> org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:134)_
>         _at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:612)_
>         _at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:447)_
>         _at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)_
>         _at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)_
>         _at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)_
>         _at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)_
>         _at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)_
>         _at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942)_
>         _at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148)_
>         _at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2075)_
>         _at 
> org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1093)_
>         _at 
> org.jets3t.service.StorageService.getObjectDetails(StorageService.java:548)_
>         _at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)_
>         _at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)_
>         _at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)_
>         _at java.lang.reflect.Method.invoke(Method.java:483)_
>         _at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)_
>         _at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)_
>         _at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)_
>         _at 
> org.apache.hadoop.io.retry.Retr

[jira] [Created] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0

2018-06-06 Thread bharath kumar avusherla (JIRA)
bharath kumar avusherla created SPARK-24476:
---

 Summary: java.net.SocketTimeoutException: Read timed out Exception 
while running the Spark Structured Streaming in 2.3.0
 Key: SPARK-24476
 URL: https://issues.apache.org/jira/browse/SPARK-24476
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: bharath kumar avusherla


We are working on spark streaming application using spark structured streaming 
with checkpointing in s3. When we start the application, the application runs 
just fine for sometime  then it crashes with the error mentioned below. The 
amount of time it will run successfully varies from time to time, sometimes it 
will run for 2 days without any issues then crashes, sometimes it will crash 
after 4hrs/ 24hrs. 

Our streaming application joins(left and inner) multiple sources from kafka and 
also s3 and aurora database.

Can you please let us know how to solve this problem? Is it possible to 
increase the timeout period?

Here, I'm pasting the complete exception log below.

*_Exception:_*

*_Caused by: java.net.SocketTimeoutException: Read timed out_*

        _at java.net.SocketInputStream.socketRead0(Native Method)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:150)_

        _at java.net.SocketInputStream.read(SocketInputStream.java:121)_

        _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_

        _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_

        _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_

        _at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_

        _at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_

        _at 
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_

        _at 
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:179)_

        _at 
org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144)_

        _at 
org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:134)_

        _at 
org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:612)_

        _at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:447)_

        _at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)_

        _at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)_

        _at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)_

        _at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)_

        _at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)_

        _at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942)_

        _at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148)_

        _at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2075)_

        _at 
org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1093)_

        _at 
org.jets3t.service.StorageService.getObjectDetails(StorageService.java:548)_

        _at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)_

        _at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)_

        _at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)_

        _at java.lang.reflect.Method.invoke(Method.java:483)_

        _at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)_

        _at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)_

        _at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)_

        _at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)_

        _at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)_

        _at org.apache.hadoop.fs.s3native.$Proxy18.retrieveMetadata(Unknown 
Source)_

        _at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:493)_

        _at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437)_

        _at 
org

[jira] [Created] (SPARK-24475) Nested JSON count() Exception

2018-06-06 Thread Joseph Toth (JIRA)
Joseph Toth created SPARK-24475:
---

 Summary: Nested JSON count() Exception
 Key: SPARK-24475
 URL: https://issues.apache.org/jira/browse/SPARK-24475
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Joseph Toth


I have nested structure json file only 2 rows.
 

{{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}}
{{jsonData = spark.read.json(file)}}

{{jsonData.count() will crash with the following exception, jsonData.head(10) 
works.}}{{}}

 

Traceback (most recent call last):
 File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line 
2882, in run_code
 exec(code_obj, self.user_global_ns, self.user_ns)
 File "", line 1, in 
 jsonData.count()
 File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line 
455, in count
 return int(self._jdf.count())
 File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1160, 
in __call__
 answer, self.gateway_client, self.target_id, self.name)
 File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, 
in deco
 return f(*a, **kw)
 File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in 
get_return_value
 format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o411.count.
: java.lang.IllegalArgumentException
 at org.apache.xbean.asm5.ClassReader.(Unknown Source)
 at org.apache.xbean.asm5.ClassReader.(Unknown Source)
 at org.apache.xbean.asm5.ClassReader.(Unknown Source)
 at 
org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
 at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
 at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
 at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
 at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
 at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
 at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
 at 
org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
 at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
 at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
 at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
 at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
 at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
 at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
 at scala.collection.immutable.List.foreach(List.scala:381)
 at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
 at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
 at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2770)
 at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2769)
 at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
 at org.apache.spark.sql.Dataset.count(Dataset.scala:2769)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:564)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 at py4j.Gateway.invoke(Gateway.java:282)
 at py4j.commands.AbstractCommand.in

[jira] [Comment Edited] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-06 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503486#comment-16503486
 ] 

Li Yuanjian edited comment on SPARK-24375 at 6/6/18 3:55 PM:
-

Hi [~cloud_fan] and [~jiangxb1987], just I tiny question here, I notice the 
discussion in SPARK-20928 mentioned the barrier scheduling which Continuous 
Processing will depend on.
{quote}
A barrier stage doesn’t launch any of its tasks until the available slots(free 
CPU cores can be used to launch pending tasks) satisfies the target to launch 
all the tasks at the same time, and always retry the whole stage when any 
task(s) fail.
{quote}
Why the task level retrying was forbidden here, is there any possible to 
achieve this? Thanks.
 

 


was (Author: xuanyuan):
Hi [~cloud_fan] and [~jiangxb1987], just I tiny question here, I notice the 
discussion in SPARK-20928 mentioned the barrier scheduling.
{quote}
A barrier stage doesn’t launch any of its tasks until the available slots(free 
CPU cores can be used to launch pending tasks) satisfies the target to launch 
all the tasks at the same time, and always retry the whole stage when any 
task(s) fail.
{quote}
Why the task level retrying was forbidden here, is there any possible to 
achieve this? Thanks.
 

 

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark

2018-06-06 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503486#comment-16503486
 ] 

Li Yuanjian commented on SPARK-24375:
-

Hi [~cloud_fan] and [~jiangxb1987], just I tiny question here, I notice the 
discussion in SPARK-20928 mentioned the barrier scheduling.
{quote}
A barrier stage doesn’t launch any of its tasks until the available slots(free 
CPU cores can be used to launch pending tasks) satisfies the target to launch 
all the tasks at the same time, and always retry the whole stage when any 
task(s) fail.
{quote}
Why the task level retrying was forbidden here, is there any possible to 
achieve this? Thanks.
 

 

> Design sketch: support barrier scheduling in Apache Spark
> -
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503466#comment-16503466
 ] 

Sean Owen commented on SPARK-24431:
---

So, the model makes the same prediction p for all examples? In that case, for 
points on the PR curve corresponding to thresholds less than p, everything is 
classified positive. Recall is 1 and precision is the fraction of examples that 
are actually positive. For thresholds greater than p, all is marked negative. 
Recall is 0, but, really precision is undefined (0/0). Yes, see SPARK-21806 for 
a discussion of the different ways you could try to handle this – no point for 
recall = 0, or precision = 0, 1, or p.

We (mostly I) settled on the latter as the least surprising change and one most 
likely to produce model comparisons that make intuitive sense. My argument 
against precision = 0 at recall = 0 is that it doesn't quite make sense that 
precision drops as recall decreases, and that would define precision as the 
smallest possible value.

You're right that this is a corner case, and there is no really correct way to 
handle it. I'd say the issue here is perhaps leaning too much on the area under 
the PR curve. It doesn't have as much meaning as the area under the ROC curve. 
I think it's just the extreme example of the problem with comparing precision 
and recall?

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException

2018-06-06 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503445#comment-16503445
 ] 

Dongjoon Hyun commented on SPARK-24472:
---

Thank you for pinging me, [~zsxwing]

> Orc RecordReaderFactory throws IndexOutOfBoundsException
> 
>
> Key: SPARK-24472
> URL: https://issues.apache.org/jira/browse/SPARK-24472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> When the column number of the underlying file schema is greater than the 
> column number of the table schema, Orc RecordReaderFactory will throw 
> IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be 
> turned off to use HiveTableScanExec. Here is a reproducer:
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> Seq(("abc", 123, 123L)).toDF("s", "i", 
> "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest")
> spark.sql("""
> CREATE EXTERNAL TABLE orctest(s string)
> PARTITIONED BY (i int)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> LOCATION '/tmp/orctest'
> """)
> spark.sql("msck repair table orctest")
> spark.sql("set spark.sql.hive.convertMetastoreOrc=false")
> // Exiting paste mode, now interpreting.
> 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.read.format("orc").load("/tmp/orctest").show()
> +---+---+---+
> |  s|  l|  i|
> +---+---+---+
> |abc|123|123|
> +---+---+---+
> scala> spark.sql("select * from orctest").show()
> 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.IndexOutOfBoundsException: toIndex = 2
>   at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
>   at java.util.ArrayList.subList(ArrayList.java:996)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
>   at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   a

[jira] [Updated] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException

2018-06-06 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24472:
--
Affects Version/s: 1.6.3
   2.0.2
   2.1.2
   2.2.1

> Orc RecordReaderFactory throws IndexOutOfBoundsException
> 
>
> Key: SPARK-24472
> URL: https://issues.apache.org/jira/browse/SPARK-24472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> When the column number of the underlying file schema is greater than the 
> column number of the table schema, Orc RecordReaderFactory will throw 
> IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be 
> turned off to use HiveTableScanExec. Here is a reproducer:
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> Seq(("abc", 123, 123L)).toDF("s", "i", 
> "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest")
> spark.sql("""
> CREATE EXTERNAL TABLE orctest(s string)
> PARTITIONED BY (i int)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> LOCATION '/tmp/orctest'
> """)
> spark.sql("msck repair table orctest")
> spark.sql("set spark.sql.hive.convertMetastoreOrc=false")
> // Exiting paste mode, now interpreting.
> 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.read.format("orc").load("/tmp/orctest").show()
> +---+---+---+
> |  s|  l|  i|
> +---+---+---+
> |abc|123|123|
> +---+---+---+
> scala> spark.sql("select * from orctest").show()
> 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.IndexOutOfBoundsException: toIndex = 2
>   at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
>   at java.util.ArrayList.subList(ArrayList.java:996)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
>   at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD

[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache

2018-06-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503394#comment-16503394
 ] 

Apache Spark commented on SPARK-22575:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21502

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22575) Making Spark Thrift Server clean up its cache

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22575:


Assignee: (was: Apache Spark)

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22575) Making Spark Thrift Server clean up its cache

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22575:


Assignee: Apache Spark

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Assignee: Apache Spark
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column

2018-06-06 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23803:
---

Assignee: Asher Saban

> Support bucket pruning to optimize filtering on a bucketed column
> -
>
> Key: SPARK-23803
> URL: https://issues.apache.org/jira/browse/SPARK-23803
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Asher Saban
>Assignee: Asher Saban
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> support bucket pruning when filtering on a single bucketed column on the 
> following predicates - 
>  # EqualTo
>  # EqualNullSafe
>  # In
>  # (1)-(3) combined in And/Or predicates
>  
> based on [~smilegator]'s work in SPARK-12850 which was removed from the code 
> base. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column

2018-06-06 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23803.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20915
[https://github.com/apache/spark/pull/20915]

> Support bucket pruning to optimize filtering on a bucketed column
> -
>
> Key: SPARK-23803
> URL: https://issues.apache.org/jira/browse/SPARK-23803
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Asher Saban
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> support bucket pruning when filtering on a single bucketed column on the 
> following predicates - 
>  # EqualTo
>  # EqualNullSafe
>  # In
>  # (1)-(3) combined in And/Or predicates
>  
> based on [~smilegator]'s work in SPARK-12850 which was removed from the code 
> base. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15882) Discuss distributed linear algebra in spark.ml package

2018-06-06 Thread Kyle Prifogle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503256#comment-16503256
 ] 

Kyle Prifogle edited comment on SPARK-15882 at 6/6/18 1:00 PM:
---

[~josephkb]  Noticing this is almost 2 years old now which gives me the 
impression that this isn't going to be done?  If I took the time to start to 
bite this off does anyone thing that they would use this or are most people 
finding their own more "expert" solutions :D ?  Seems like matrix 
representations and basic algebra is pretty fundamental.


was (Author: kprifogle1):
Noticing this is almost 2 years old now which gives me the impression that this 
isn't going to be done?  If I took the time to start to bite this off does 
anyone thing that they would use this or are most people finding their own more 
"expert" solutions :D ?  Seems like matrix representations and basic algebra is 
pretty fundamental.

> Discuss distributed linear algebra in spark.ml package
> --
>
> Key: SPARK-15882
> URL: https://issues.apache.org/jira/browse/SPARK-15882
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* 
> should be migrated to org.apache.spark.ml.
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this 
> move?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package

2018-06-06 Thread Kyle Prifogle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503256#comment-16503256
 ] 

Kyle Prifogle commented on SPARK-15882:
---

Noticing this is almost 2 years old now which gives me the impression that this 
isn't going to be done?  If I took the time to start to bite this off does 
anyone thing that they would use this or are most people finding their own more 
"expert" solutions :D ?  Seems like matrix representations and basic algebra is 
pretty fundamental.

> Discuss distributed linear algebra in spark.ml package
> --
>
> Key: SPARK-15882
> URL: https://issues.apache.org/jira/browse/SPARK-15882
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* 
> should be migrated to org.apache.spark.ml.
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this 
> move?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24471) MlLib distributed plans

2018-06-06 Thread Kyle Prifogle (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Prifogle resolved SPARK-24471.
---
Resolution: Duplicate

> MlLib distributed plans
> ---
>
> Key: SPARK-24471
> URL: https://issues.apache.org/jira/browse/SPARK-24471
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Kyle Prifogle
>Priority: Major
>
>  
> I have found myself using MlLib's CoordinateMatrix and RowMatrix alot lately. 
>  Since the new API is centered on Ml.linalg and MlLib is in maintenence mode 
> are there plans to move all the matrix components over to Ml.linalg?  I dont 
> see a distributed package in the new one yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24471) MlLib distributed plans

2018-06-06 Thread Kyle Prifogle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503251#comment-16503251
 ] 

Kyle Prifogle commented on SPARK-24471:
---

This is what I was looking for:

https://issues.apache.org/jira/browse/SPARK-15882

Also please remove the "Question" option from Jira if its not to be used.

> MlLib distributed plans
> ---
>
> Key: SPARK-24471
> URL: https://issues.apache.org/jira/browse/SPARK-24471
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Kyle Prifogle
>Priority: Major
>
>  
> I have found myself using MlLib's CoordinateMatrix and RowMatrix alot lately. 
>  Since the new API is centered on Ml.linalg and MlLib is in maintenence mode 
> are there plans to move all the matrix components over to Ml.linalg?  I dont 
> see a distributed package in the new one yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24474:
-
Description: 
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled
 * It happens when dynamic allocation is enabled, and when it is disabled
 * The stage that hangs (referred to as "Second stage" above) has a lower 
'Stage Id' than the first one that completes

  was:
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled
 * It happens when dynamic allocation is enabled, and when it is disabled


> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stag

[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503225#comment-16503225
 ] 

Al M commented on SPARK-24474:
--

I appreciate that 2.2.0 is slightly old but I couldn't see any scheduler fixes 
in later versions that sounded like this.

> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24474:
-
Description: 
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled
 * It happens when dynamic allocation is enabled, and when it is disabled

  was:
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled


> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator

2018-06-06 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503217#comment-16503217
 ] 

Teng Peng commented on SPARK-24431:
---

[~Ben2018] The article makes sense to me. It seems the current behavior follows 
"Case 2: TP is not 0", but it set precision = 1 if recall = 0. (See 
[https://github.com/apache/spark/blob/734ed7a7b397578f16549070f350215bde369b3c/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L110]
 )

Have you checked out SPARK-21806 and its discussion on JIRA? 

Let's hear [~srowen] 's opinions.

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---
>
> Key: SPARK-24431
> URL: https://issues.apache.org/jira/browse/SPARK-24431
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Xinyong Tian
>Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)
Al M created SPARK-24474:


 Summary: Cores are left idle when there are a lot of stages to run
 Key: SPARK-24474
 URL: https://issues.apache.org/jira/browse/SPARK-24474
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Al M


I've observed an issue happening consistently when:
* A job contains a join of two datasets
* One dataset is much larger than the other
* Both datasets require some processing before they are joined

What I have observed is:
* 2 stages are initially active to run processing on the two datasets
** These stages are run in parallel
** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
** Spark allocates a similar (though not exactly equal) number of cores to each 
stage
* First stage completes (for the smaller dataset)
** Now there is only one stage running
** It still has many tasks left (usually > 20k tasks)
** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
** This continues until the second stage completes
* Second stage completes, and third begins (the stage that actually joins the 
data)
** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
* It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to other stages
* I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24474:
-
Description: 
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled

  was:
I've observed an issue happening consistently when:
* A job contains a join of two datasets
* One dataset is much larger than the other
* Both datasets require some processing before they are joined

What I have observed is:
* 2 stages are initially active to run processing on the two datasets
** These stages are run in parallel
** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
** Spark allocates a similar (though not exactly equal) number of cores to each 
stage
* First stage completes (for the smaller dataset)
** Now there is only one stage running
** It still has many tasks left (usually > 20k tasks)
** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
** This continues until the second stage completes
* Second stage completes, and third begins (the stage that actually joins the 
data)
** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
* It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to other stages
* I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled


> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition

2018-06-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503150#comment-16503150
 ] 

Apache Spark commented on SPARK-21687:
--

User 'debugger87' has created a pull request for this issue:
https://github.com/apache/spark/pull/18900

> Spark SQL should set createTime for Hive partition
> --
>
> Key: SPARK-21687
> URL: https://issues.apache.org/jira/browse/SPARK-21687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Chaozhong Yang
>Priority: Minor
>
> In Spark SQL, we often use `insert overwite table t partition(p=xx)` to 
> create partition for partitioned table. `createTime` is an important 
> information to manage data lifecycle, e.g TTL.
> However, we found that Spark SQL doesn't call setCreateTime in 
> `HiveClientImpl#toHivePartition` as follows:
> {code:scala}
> def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> val tpart = new org.apache.hadoop.hive.metastore.api.Partition
> val partValues = ht.getPartCols.asScala.map { hc =>
>   p.spec.get(hc.getName).getOrElse {
> throw new IllegalArgumentException(
>   s"Partition spec is missing a value for column '${hc.getName}': 
> ${p.spec}")
>   }
> }
> val storageDesc = new StorageDescriptor
> val serdeInfo = new SerDeInfo
> 
> p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation)
> p.storage.inputFormat.foreach(storageDesc.setInputFormat)
> p.storage.outputFormat.foreach(storageDesc.setOutputFormat)
> p.storage.serde.foreach(serdeInfo.setSerializationLib)
> serdeInfo.setParameters(p.storage.properties.asJava)
> storageDesc.setSerdeInfo(serdeInfo)
> tpart.setDbName(ht.getDbName)
> tpart.setTableName(ht.getTableName)
> tpart.setValues(partValues.asJava)
> tpart.setSd(storageDesc)
> new HivePartition(ht, tpart)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21687) Spark SQL should set createTime for Hive partition

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21687:


Assignee: (was: Apache Spark)

> Spark SQL should set createTime for Hive partition
> --
>
> Key: SPARK-21687
> URL: https://issues.apache.org/jira/browse/SPARK-21687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Chaozhong Yang
>Priority: Minor
>
> In Spark SQL, we often use `insert overwite table t partition(p=xx)` to 
> create partition for partitioned table. `createTime` is an important 
> information to manage data lifecycle, e.g TTL.
> However, we found that Spark SQL doesn't call setCreateTime in 
> `HiveClientImpl#toHivePartition` as follows:
> {code:scala}
> def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> val tpart = new org.apache.hadoop.hive.metastore.api.Partition
> val partValues = ht.getPartCols.asScala.map { hc =>
>   p.spec.get(hc.getName).getOrElse {
> throw new IllegalArgumentException(
>   s"Partition spec is missing a value for column '${hc.getName}': 
> ${p.spec}")
>   }
> }
> val storageDesc = new StorageDescriptor
> val serdeInfo = new SerDeInfo
> 
> p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation)
> p.storage.inputFormat.foreach(storageDesc.setInputFormat)
> p.storage.outputFormat.foreach(storageDesc.setOutputFormat)
> p.storage.serde.foreach(serdeInfo.setSerializationLib)
> serdeInfo.setParameters(p.storage.properties.asJava)
> storageDesc.setSerdeInfo(serdeInfo)
> tpart.setDbName(ht.getDbName)
> tpart.setTableName(ht.getTableName)
> tpart.setValues(partValues.asJava)
> tpart.setSd(storageDesc)
> new HivePartition(ht, tpart)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21687) Spark SQL should set createTime for Hive partition

2018-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21687:


Assignee: Apache Spark

> Spark SQL should set createTime for Hive partition
> --
>
> Key: SPARK-21687
> URL: https://issues.apache.org/jira/browse/SPARK-21687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Chaozhong Yang
>Assignee: Apache Spark
>Priority: Minor
>
> In Spark SQL, we often use `insert overwite table t partition(p=xx)` to 
> create partition for partitioned table. `createTime` is an important 
> information to manage data lifecycle, e.g TTL.
> However, we found that Spark SQL doesn't call setCreateTime in 
> `HiveClientImpl#toHivePartition` as follows:
> {code:scala}
> def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> val tpart = new org.apache.hadoop.hive.metastore.api.Partition
> val partValues = ht.getPartCols.asScala.map { hc =>
>   p.spec.get(hc.getName).getOrElse {
> throw new IllegalArgumentException(
>   s"Partition spec is missing a value for column '${hc.getName}': 
> ${p.spec}")
>   }
> }
> val storageDesc = new StorageDescriptor
> val serdeInfo = new SerDeInfo
> 
> p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation)
> p.storage.inputFormat.foreach(storageDesc.setInputFormat)
> p.storage.outputFormat.foreach(storageDesc.setOutputFormat)
> p.storage.serde.foreach(serdeInfo.setSerializationLib)
> serdeInfo.setParameters(p.storage.properties.asJava)
> storageDesc.setSerdeInfo(serdeInfo)
> tpart.setDbName(ht.getDbName)
> tpart.setTableName(ht.getTableName)
> tpart.setValues(partValues.asJava)
> tpart.setSd(storageDesc)
> new HivePartition(ht, tpart)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them

2018-06-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503065#comment-16503065
 ] 

Liang-Chi Hsieh edited comment on SPARK-24357 at 6/6/18 9:50 AM:
-

I think this is because this number {{1 << 65}} (36893488147419103232L) is more 
than Scala's long range.

Scala:
scala> Long.MaxValue
res3: Long = 9223372036854775807

Python:
>>> TEST_DATA = [Row(data=9223372036854775807L)]
>>> frame = spark.createDataFrame(TEST_DATA)
>>> frame.collect()
[Row(data=9223372036854775807)] 
>>> TEST_DATA = [Row(data=9223372036854775808L)]
>>> frame = spark.createDataFrame(TEST_DATA)
>>> frame.collect()
[Row(data=None)]



was (Author: viirya):
I think this is because this number {{1 << 65}} (36893488147419103232L) is more 
than Scala's long range.

>>> TEST_DATA = [Row(data=9223372036854775807L)]
>>> frame = spark.createDataFrame(TEST_DATA)
>>> frame.collect()
[Row(data=9223372036854775807)] 
>>> TEST_DATA = [Row(data=9223372036854775808L)]
>>> frame = spark.createDataFrame(TEST_DATA)
>>> frame.collect()
[Row(data=None)]


> createDataFrame in Python infers large integers as long type and then fails 
> silently when converting them
> -
>
> Key: SPARK-24357
> URL: https://issues.apache.org/jira/browse/SPARK-24357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Major
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark 
> SQL will infer any integral type as a LongType, which is a 64-bit integer, 
> without actually checking whether the values will fit into a 64-bit slot. If 
> the values are larger than 64 bits, then when pickled and unpickled in Java, 
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is 
> called, it will ignore the BigInteger type and return Null. This results in 
> any large integers in the resulting DataFrame being silently converted to 
> None. This can create some very surprising and difficult to debug behavior, 
> in particular if you are not aware of this limitation. There should either be 
> a runtime error at some point in this conversion chain, or else _infer_type 
> should infer larger integers as DecimalType with appropriate precision, or as 
> BinaryType. The former would be less convenient, but the latter may be 
> problematic to implement in practice. In any case, we should stop silently 
> converting large integers to None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them

2018-06-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503065#comment-16503065
 ] 

Liang-Chi Hsieh commented on SPARK-24357:
-

I think this is because this number {{1 << 65}} (36893488147419103232L) is more 
than Scala's long range.

>>> TEST_DATA = [Row(data=9223372036854775807L)]
>>> frame = spark.createDataFrame(TEST_DATA)
>>> frame.collect()
[Row(data=9223372036854775807)] 
>>> TEST_DATA = [Row(data=9223372036854775808L)]
>>> frame = spark.createDataFrame(TEST_DATA)
>>> frame.collect()
[Row(data=None)]


> createDataFrame in Python infers large integers as long type and then fails 
> silently when converting them
> -
>
> Key: SPARK-24357
> URL: https://issues.apache.org/jira/browse/SPARK-24357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Major
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark 
> SQL will infer any integral type as a LongType, which is a 64-bit integer, 
> without actually checking whether the values will fit into a 64-bit slot. If 
> the values are larger than 64 bits, then when pickled and unpickled in Java, 
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is 
> called, it will ignore the BigInteger type and return Null. This results in 
> any large integers in the resulting DataFrame being silently converted to 
> None. This can create some very surprising and difficult to debug behavior, 
> in particular if you are not aware of this limitation. There should either be 
> a runtime error at some point in this conversion chain, or else _infer_type 
> should infer larger integers as DecimalType with appropriate precision, or as 
> BinaryType. The former would be less convenient, but the latter may be 
> problematic to implement in practice. In any case, we should stop silently 
> converting large integers to None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24473) It is no need to clip the predictive value by maxValue and minValue when computing gradient on SVDplusplus model

2018-06-06 Thread caijianming (JIRA)
caijianming created SPARK-24473:
---

 Summary: It is no need to clip the predictive value by maxValue 
and minValue when computing gradient on SVDplusplus model
 Key: SPARK-24473
 URL: https://issues.apache.org/jira/browse/SPARK-24473
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.3.0
Reporter: caijianming


I think it is no need to clip the predictive value. It will change the convex 
loss function to non-convex, which might have a bad influence on convergence.

Other famous recommender systems and original paper also do not include this 
step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator

2018-06-06 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503002#comment-16503002
 ] 

Liang-Chi Hsieh commented on SPARK-24467:
-

[~josephkb] Does that mean {{VectorAssembler}} will change from a 
{{Transformer}} to a {{Model}}?

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15064) Locale support in StopWordsRemover

2018-06-06 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502929#comment-16502929
 ] 

yuhao yang commented on SPARK-15064:


Yuhao will be OOF from May 29th to June 6th (annual leave and conference). 
Please expect delayed email response. Conctact 669 243 8273 for anything urgent.

Regards,
Yuhao


> Locale support in StopWordsRemover
> --
>
> Key: SPARK-15064
> URL: https://issues.apache.org/jira/browse/SPARK-15064
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We support case insensitive filtering (default) in StopWordsRemover. However, 
> case insensitive matching depends on the locale and region, which cannot be 
> explicitly set in StopWordsRemover. We should consider adding this support in 
> MLlib.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15064) Locale support in StopWordsRemover

2018-06-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502928#comment-16502928
 ] 

Apache Spark commented on SPARK-15064:
--

User 'dongjinleekr' has created a pull request for this issue:
https://github.com/apache/spark/pull/21501

> Locale support in StopWordsRemover
> --
>
> Key: SPARK-15064
> URL: https://issues.apache.org/jira/browse/SPARK-15064
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We support case insensitive filtering (default) in StopWordsRemover. However, 
> case insensitive matching depends on the locale and region, which cannot be 
> explicitly set in StopWordsRemover. We should consider adding this support in 
> MLlib.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException

2018-06-06 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502914#comment-16502914
 ] 

Shixiong Zhu commented on SPARK-24472:
--

cc [~dongjoon]

> Orc RecordReaderFactory throws IndexOutOfBoundsException
> 
>
> Key: SPARK-24472
> URL: https://issues.apache.org/jira/browse/SPARK-24472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> When the column number of the underlying file schema is greater than the 
> column number of the table schema, Orc RecordReaderFactory will throw 
> IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be 
> turned off to use HiveTableScanExec. Here is a reproducer:
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> Seq(("abc", 123, 123L)).toDF("s", "i", 
> "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest")
> spark.sql("""
> CREATE EXTERNAL TABLE orctest(s string)
> PARTITIONED BY (i int)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> WITH SERDEPROPERTIES (
>   'serialization.format' = '1'
> )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> LOCATION '/tmp/orctest'
> """)
> spark.sql("msck repair table orctest")
> spark.sql("set spark.sql.hive.convertMetastoreOrc=false")
> // Exiting paste mode, now interpreting.
> 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.read.format("orc").load("/tmp/orctest").show()
> +---+---+---+
> |  s|  l|  i|
> +---+---+---+
> |abc|123|123|
> +---+---+---+
> scala> spark.sql("select * from orctest").show()
> 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.IndexOutOfBoundsException: toIndex = 2
>   at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
>   at java.util.ArrayList.subList(ArrayList.java:996)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
>   at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>

[jira] [Closed] (SPARK-24455) fix typo in TaskSchedulerImpl's comments

2018-06-06 Thread xueyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xueyu closed SPARK-24455.
-

> fix typo in TaskSchedulerImpl's comments
> 
>
> Key: SPARK-24455
> URL: https://issues.apache.org/jira/browse/SPARK-24455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: xueyu
>Assignee: xueyu
>Priority: Trivial
> Fix For: 2.3.2, 2.4.0
>
>
> fix the method name in TaskSchedulerImpl.scala 's comments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org