[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()
[ https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504316#comment-16504316 ] Hyukjin Kwon commented on SPARK-24442: -- I meant please avoid setting "Fix Version/s" field when the JIRA is created. > Add configuration parameter to adjust the numbers of records and the charters > per row before truncation when a user runs.show() > --- > > Key: SPARK-24442 > URL: https://issues.apache.org/jira/browse/SPARK-24442 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Andrew K Long >Priority: Minor > Attachments: spark-adjustable-display-size.diff > > Original Estimate: 12h > Remaining Estimate: 12h > > Currently the number of characters displayed when a user runs the .show() > function on a data frame is hard coded. The current default is too small when > used with wider console widths. This fix will add two parameters. > > parameter: "spark.show.default.number.of.rows" default: "20" > parameter: "spark.show.default.truncate.characters.per.column" default: "20" > > This change will be backwords compatible and will not break any existing > functionality nor change the default display characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24279) Incompatible byte code errors, when using test-jar of spark sql.
[ https://issues.apache.org/jira/browse/SPARK-24279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-24279. - Resolution: Invalid > Incompatible byte code errors, when using test-jar of spark sql. > > > Key: SPARK-24279 > URL: https://issues.apache.org/jira/browse/SPARK-24279 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming, Tests >Affects Versions: 2.3.0, 2.4.0 >Reporter: Prashant Sharma >Priority: Major > > Using test libraries available with spark sql streaming, produces weird > incompatible byte code errors. It is already tested on a different virtual > box instance to make sure, that it is not related to my system environment. A > reproducer is uploaded to github. > [https://github.com/ScrapCodes/spark-bug-reproducer] > > Doing a clean build reproduces the error. > > Verbatim paste of the error. > {code:java} > [INFO] Compiling 1 source files to > /home/prashant/work/test/target/test-classes at 1526380360990 > [ERROR] error: missing or invalid dependency detected while loading class > file 'QueryTest.class'. > [INFO] Could not access type PlanTest in package > org.apache.spark.sql.catalyst.plans, > [INFO] because it (or its dependencies) are missing. Check your build > definition for > [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to > see the problematic classpath.) > [INFO] A full rebuild may help if 'QueryTest.class' was compiled against an > incompatible version of org.apache.spark.sql.catalyst.plans. > [ERROR] error: missing or invalid dependency detected while loading class > file 'SQLTestUtilsBase.class'. > [INFO] Could not access type PlanTestBase in package > org.apache.spark.sql.catalyst.plans, > [INFO] because it (or its dependencies) are missing. Check your build > definition for > [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to > see the problematic classpath.) > [INFO] A full rebuild may help if 'SQLTestUtilsBase.class' was compiled > against an incompatible version of org.apache.spark.sql.catalyst.plans. > [ERROR] error: missing or invalid dependency detected while loading class > file 'SQLTestUtils.class'. > [INFO] Could not access type PlanTest in package > org.apache.spark.sql.catalyst.plans, > [INFO] because it (or its dependencies) are missing. Check your build > definition for > [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to > see the problematic classpath.) > [INFO] A full rebuild may help if 'SQLTestUtils.class' was compiled against > an incompatible version of org.apache.spark.sql.catalyst.plans. > [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:25: > error: Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will be added > in future releases. > [ERROR] val inputData = MemoryStream[Int] > [ERROR] ^ > [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:30: > error: Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will be added > in future releases. > [ERROR] CheckAnswer(2, 3, 4)) > [ERROR] ^ > [ERROR] 5 errors found > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24477) Import submodules under pyspark.ml by default
[ https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24477: Assignee: (was: Apache Spark) > Import submodules under pyspark.ml by default > - > > Key: SPARK-24477 > URL: https://issues.apache.org/jira/browse/SPARK-24477 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > Right now, we do not import submodules under pyspark.ml by default. So users > cannot do > {code} > from pyspark import ml > kmeans = ml.clustering.KMeans(...) > {code} > I create this JIRA to discuss if we should import the submodules by default. > It will change behavior of > {code} > from pyspark.ml import * > {code} > But it simplifies unnecessary imports. > cc [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24477) Import submodules under pyspark.ml by default
[ https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24477: Assignee: Apache Spark > Import submodules under pyspark.ml by default > - > > Key: SPARK-24477 > URL: https://issues.apache.org/jira/browse/SPARK-24477 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Major > > Right now, we do not import submodules under pyspark.ml by default. So users > cannot do > {code} > from pyspark import ml > kmeans = ml.clustering.KMeans(...) > {code} > I create this JIRA to discuss if we should import the submodules by default. > It will change behavior of > {code} > from pyspark.ml import * > {code} > But it simplifies unnecessary imports. > cc [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24477) Import submodules under pyspark.ml by default
[ https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504247#comment-16504247 ] Apache Spark commented on SPARK-24477: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21483 > Import submodules under pyspark.ml by default > - > > Key: SPARK-24477 > URL: https://issues.apache.org/jira/browse/SPARK-24477 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > Right now, we do not import submodules under pyspark.ml by default. So users > cannot do > {code} > from pyspark import ml > kmeans = ml.clustering.KMeans(...) > {code} > I create this JIRA to discuss if we should import the submodules by default. > It will change behavior of > {code} > from pyspark.ml import * > {code} > But it simplifies unnecessary imports. > cc [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24279) Incompatible byte code errors, when using test-jar of spark sql.
[ https://issues.apache.org/jira/browse/SPARK-24279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504246#comment-16504246 ] Prashant Sharma commented on SPARK-24279: - Thanks a lot, that was the mistake. > Incompatible byte code errors, when using test-jar of spark sql. > > > Key: SPARK-24279 > URL: https://issues.apache.org/jira/browse/SPARK-24279 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming, Tests >Affects Versions: 2.3.0, 2.4.0 >Reporter: Prashant Sharma >Priority: Major > > Using test libraries available with spark sql streaming, produces weird > incompatible byte code errors. It is already tested on a different virtual > box instance to make sure, that it is not related to my system environment. A > reproducer is uploaded to github. > [https://github.com/ScrapCodes/spark-bug-reproducer] > > Doing a clean build reproduces the error. > > Verbatim paste of the error. > {code:java} > [INFO] Compiling 1 source files to > /home/prashant/work/test/target/test-classes at 1526380360990 > [ERROR] error: missing or invalid dependency detected while loading class > file 'QueryTest.class'. > [INFO] Could not access type PlanTest in package > org.apache.spark.sql.catalyst.plans, > [INFO] because it (or its dependencies) are missing. Check your build > definition for > [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to > see the problematic classpath.) > [INFO] A full rebuild may help if 'QueryTest.class' was compiled against an > incompatible version of org.apache.spark.sql.catalyst.plans. > [ERROR] error: missing or invalid dependency detected while loading class > file 'SQLTestUtilsBase.class'. > [INFO] Could not access type PlanTestBase in package > org.apache.spark.sql.catalyst.plans, > [INFO] because it (or its dependencies) are missing. Check your build > definition for > [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to > see the problematic classpath.) > [INFO] A full rebuild may help if 'SQLTestUtilsBase.class' was compiled > against an incompatible version of org.apache.spark.sql.catalyst.plans. > [ERROR] error: missing or invalid dependency detected while loading class > file 'SQLTestUtils.class'. > [INFO] Could not access type PlanTest in package > org.apache.spark.sql.catalyst.plans, > [INFO] because it (or its dependencies) are missing. Check your build > definition for > [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to > see the problematic classpath.) > [INFO] A full rebuild may help if 'SQLTestUtils.class' was compiled against > an incompatible version of org.apache.spark.sql.catalyst.plans. > [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:25: > error: Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will be added > in future releases. > [ERROR] val inputData = MemoryStream[Int] > [ERROR] ^ > [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:30: > error: Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will be added > in future releases. > [ERROR] CheckAnswer(2, 3, 4)) > [ERROR] ^ > [ERROR] 5 errors found > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24475) Nested JSON count() Exception
[ https://issues.apache.org/jira/browse/SPARK-24475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24475. -- Resolution: Duplicate > Nested JSON count() Exception > - > > Key: SPARK-24475 > URL: https://issues.apache.org/jira/browse/SPARK-24475 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Joseph Toth >Priority: Major > > I have nested structure json file only 2 rows. > > {{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}} > {{jsonData = spark.read.json(file)}} > {{jsonData.count() will crash with the following exception, jsonData.head(10) > works.}}{{}} > > Traceback (most recent call last): > File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line > 2882, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > jsonData.count() > File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line > 455, in count > return int(self._jdf.count()) > File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line > 1160, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling o411.count. > : java.lang.IllegalArgumentException > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46) > at > org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449) > at > org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103) > at > scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103) > at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432) > at org.apache.xbean.asm5.ClassReader.a(Unknown Source) > at org.apache.xbean.asm5.ClassReader.b(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at > org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262) > at > org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2292) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.collect(RDD.scala:938) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2770) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2769) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2769) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.b
[jira] [Commented] (SPARK-24475) Nested JSON count() Exception
[ https://issues.apache.org/jira/browse/SPARK-24475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504235#comment-16504235 ] Hyukjin Kwon commented on SPARK-24475: -- I don't think Spark currently support Java 9 and 10 yet. Let's leave this as a duplicate of SPARK-24417 > Nested JSON count() Exception > - > > Key: SPARK-24475 > URL: https://issues.apache.org/jira/browse/SPARK-24475 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Joseph Toth >Priority: Major > > I have nested structure json file only 2 rows. > > {{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}} > {{jsonData = spark.read.json(file)}} > {{jsonData.count() will crash with the following exception, jsonData.head(10) > works.}}{{}} > > Traceback (most recent call last): > File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line > 2882, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > jsonData.count() > File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line > 455, in count > return int(self._jdf.count()) > File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line > 1160, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling o411.count. > : java.lang.IllegalArgumentException > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46) > at > org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449) > at > org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103) > at > scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103) > at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432) > at org.apache.xbean.asm5.ClassReader.a(Unknown Source) > at org.apache.xbean.asm5.ClassReader.b(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at > org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262) > at > org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2292) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.collect(RDD.scala:938) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2770) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2769) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2769) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Nati
[jira] [Comment Edited] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504213#comment-16504213 ] Liang-Chi Hsieh edited comment on SPARK-24447 at 6/7/18 4:09 AM: - I just built Spark from current 2.3 branch. The above example code also works well on it. was (Author: viirya): I just build Spark from current 2.3 branch. The above example code also works well. > Pyspark RowMatrix.columnSimilarities() loses spark context > -- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Perry Chu >Priority: Major > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > This snippet should reproduce the error: > {code:java} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) > matrix = RowMatrix(rows) > sims = matrix.columnSimilarities() > ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) > print(sims.numRows(),sims.numCols()) > ## This throws an error (stack trace below) > print(sims.entries.first()) > ## Later I tried this > print(rows.context) # > print(sims.entries.context) # PySparkShell>, then throws an error{code} > Error stack trace > {code:java} > --- > AttributeError Traceback (most recent call last) > in () > > 1 sims.entries.first() > /usr/lib/spark/python/pyspark/rdd.py in first(self) > 1374 ValueError: RDD is empty > 1375 """ > -> 1376 rs = self.take(1) > 1377 if rs: > 1378 return rs[0] > /usr/lib/spark/python/pyspark/rdd.py in take(self, num) > 1356 > 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) > -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) > 1359 > 1360 items += res > /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, > partitions, allowLocal) > 999 # SparkContext#runJob. > 1000 mappedRDD = rdd.mapPartitions(partitionFunc) > -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) > 1003 > AttributeError: 'NoneType' object has no attribute 'sc' > {code} > PySpark columnSimilarities documentation > http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504213#comment-16504213 ] Liang-Chi Hsieh commented on SPARK-24447: - I just build Spark from current 2.3 branch. The above example code also works well. > Pyspark RowMatrix.columnSimilarities() loses spark context > -- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Perry Chu >Priority: Major > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > This snippet should reproduce the error: > {code:java} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) > matrix = RowMatrix(rows) > sims = matrix.columnSimilarities() > ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) > print(sims.numRows(),sims.numCols()) > ## This throws an error (stack trace below) > print(sims.entries.first()) > ## Later I tried this > print(rows.context) # > print(sims.entries.context) # PySparkShell>, then throws an error{code} > Error stack trace > {code:java} > --- > AttributeError Traceback (most recent call last) > in () > > 1 sims.entries.first() > /usr/lib/spark/python/pyspark/rdd.py in first(self) > 1374 ValueError: RDD is empty > 1375 """ > -> 1376 rs = self.take(1) > 1377 if rs: > 1378 return rs[0] > /usr/lib/spark/python/pyspark/rdd.py in take(self, num) > 1356 > 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) > -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) > 1359 > 1360 items += res > /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, > partitions, allowLocal) > 999 # SparkContext#runJob. > 1000 mappedRDD = rdd.mapPartitions(partitionFunc) > -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) > 1003 > AttributeError: 'NoneType' object has no attribute 'sc' > {code} > PySpark columnSimilarities documentation > http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504202#comment-16504202 ] Xinyong Tian commented on SPARK-24431: -- I also feel it is reasonable to set first point as (0,p). In fact, as long as it is not (0,1), aucPR will be small enough for a model that predicts same p for all examples, so cross validation will not select such model. > wrong areaUnderPR calculation in BinaryClassificationEvaluator > --- > > Key: SPARK-24431 > URL: https://issues.apache.org/jira/browse/SPARK-24431 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Xinyong Tian >Priority: Major > > My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., > evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) to > select best model. when the regParam in logistict regression is very high, no > variable will be selected (no model), ie every row 's prediction is same ,eg. > equal event rate (baseline frequency). But at this point, > BinaryClassificationEvaluator set the areaUnderPR highest. As a result best > model seleted is a no model. > the reason is following. at time of no model, precision recall curve will be > only two points: at recall =0, precision should be set to zero , while the > software set it to 1. at recall=1, precision is the event rate. As a result, > the areaUnderPR will be close 0.5 (my even rate is very low), which is > maximum . > the solution is to set precision =0 when recall =0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504171#comment-16504171 ] Li Yuanjian commented on SPARK-24375: - Got it, great thanks for your detailed explanation. > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24475) Nested JSON count() Exception
[ https://issues.apache.org/jira/browse/SPARK-24475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504163#comment-16504163 ] Joseph Toth commented on SPARK-24475: - It looks like it was my version of java on this machine. It was 9 jre. I upgraded to 10jdk, but received the error at the bottom. Then I downgraded to 8jdk and it worked. Thanks a bunch for getting back to me! openjdk version "1.8.0_171" OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-2-b11) OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode) ❯ java --version openjdk 10.0.1 2018-04-17 OpenJDK Runtime Environment (build 10.0.1+10-Debian-4) OpenJDK 64-Bit Server VM (build 10.0.1+10-Debian-4, mixed mode) ❯ spark-shell --master 'local[2]' WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 2018-06-06 22:08:16 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Failed to initialize compiler: object java.lang.Object in compiler mirror not found. ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programmatically, settings.usejavacp.value = true. Failed to initialize compiler: object java.lang.Object in compiler mirror not found. ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programmatically, settings.usejavacp.value = true. Exception in thread "main" java.lang.NullPointerException at scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256) at scala.tools.nsc.interpreter.IMain$Request.x$20$lzycompute(IMain.scala:896) at scala.tools.nsc.interpreter.IMain$Request.x$20(IMain.scala:895) at scala.tools.nsc.interpreter.IMain$Request.headerPreamble$lzycompute(IMain.scala:895) at scala.tools.nsc.interpreter.IMain$Request.headerPreamble(IMain.scala:895) at scala.tools.nsc.interpreter.IMain$Request$Wrapper.preamble(IMain.scala:918) at scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1337) at scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1336) at scala.tools.nsc.util.package$.stringFromWriter(package.scala:64) at scala.tools.nsc.interpreter.IMain$CodeAssembler$class.apply(IMain.scala:1336) at scala.tools.nsc.interpreter.IMain$Request$Wrapper.apply(IMain.scala:908) at scala.tools.nsc.interpreter.IMain$Request.compile$lzycompute(IMain.scala:1002) at scala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:997) at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:579) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) > Nested JSON count() Exception > - > > Key: SPARK-24475 > URL: https://issues.apache.org/jira/browse/SPARK-24475 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Joseph Toth >Priority: Major > > I have nested structure json file only 2 rows. > > {{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}} > {{jsonData = spark.read.json(file)}} > {{jsonData.count() will crash with the following exception, jsonData.head(10) > works.}}{{}} > > Traceback (most recent call last): > File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line > 2882, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > jsonData.count() > File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line > 455, in count > return int(self._jdf.count()) > File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line > 1160, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling o411.count. > : java.lang.IllegalArgumentException > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at > org.apache.spark.util.ClosureCleaner$.
[jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504152#comment-16504152 ] Liang-Chi Hsieh commented on SPARK-24447: - Yes, I can run the example code on a build from latest source code. Because I can't reproduce it with your provided example on latest codebase, I'd like to know if it is a problem in Spark codebase or not. Since you said the affect version is 2.3.0, I will try it on 2.3.0 codebase to see if there is a bug. > Pyspark RowMatrix.columnSimilarities() loses spark context > -- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Perry Chu >Priority: Major > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > This snippet should reproduce the error: > {code:java} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) > matrix = RowMatrix(rows) > sims = matrix.columnSimilarities() > ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) > print(sims.numRows(),sims.numCols()) > ## This throws an error (stack trace below) > print(sims.entries.first()) > ## Later I tried this > print(rows.context) # > print(sims.entries.context) # PySparkShell>, then throws an error{code} > Error stack trace > {code:java} > --- > AttributeError Traceback (most recent call last) > in () > > 1 sims.entries.first() > /usr/lib/spark/python/pyspark/rdd.py in first(self) > 1374 ValueError: RDD is empty > 1375 """ > -> 1376 rs = self.take(1) > 1377 if rs: > 1378 return rs[0] > /usr/lib/spark/python/pyspark/rdd.py in take(self, num) > 1356 > 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) > -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) > 1359 > 1360 items += res > /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, > partitions, allowLocal) > 999 # SparkContext#runJob. > 1000 mappedRDD = rdd.mapPartitions(partitionFunc) > -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) > 1003 > AttributeError: 'NoneType' object has no attribute 'sc' > {code} > PySpark columnSimilarities documentation > http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24475) Nested JSON count() Exception
[ https://issues.apache.org/jira/browse/SPARK-24475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504134#comment-16504134 ] Hyukjin Kwon commented on SPARK-24475: -- I can't reproduce this. What's your version and env? does other JSON file work fine? > Nested JSON count() Exception > - > > Key: SPARK-24475 > URL: https://issues.apache.org/jira/browse/SPARK-24475 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Joseph Toth >Priority: Major > > I have nested structure json file only 2 rows. > > {{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}} > {{jsonData = spark.read.json(file)}} > {{jsonData.count() will crash with the following exception, jsonData.head(10) > works.}}{{}} > > Traceback (most recent call last): > File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line > 2882, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > jsonData.count() > File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line > 455, in count > return int(self._jdf.count()) > File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line > 1160, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling o411.count. > : java.lang.IllegalArgumentException > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at org.apache.xbean.asm5.ClassReader.(Unknown Source) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46) > at > org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449) > at > org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103) > at > scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103) > at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432) > at org.apache.xbean.asm5.ClassReader.a(Unknown Source) > at org.apache.xbean.asm5.ClassReader.b(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at > org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262) > at > org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2292) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.collect(RDD.scala:938) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2770) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2769) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2769) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at
[jira] [Updated] (SPARK-24479) Register StreamingQueryListener in Spark Conf
[ https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24479: - Affects Version/s: 2.4.0 > Register StreamingQueryListener in Spark Conf > -- > > Key: SPARK-24479 > URL: https://issues.apache.org/jira/browse/SPARK-24479 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0, 2.4.0 >Reporter: Mingjie Tang >Priority: Major > Labels: feature > > Users need to register their own StreamingQueryListener into > StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS > and QUERY_EXECUTION_LISTENERS. > We propose to provide STREAMING_QUERY_LISTENER Conf for user to register > their own listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24480) Add a config to register custom StreamingQueryListeners
[ https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504121#comment-16504121 ] Hyukjin Kwon commented on SPARK-24480: -- (let's avoid to set the fix version which is usually set when it's actually fixed) > Add a config to register custom StreamingQueryListeners > --- > > Key: SPARK-24480 > URL: https://issues.apache.org/jira/browse/SPARK-24480 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Minor > > Currently a "StreamingQueryListener" can only be registered programatically. > We could have a new config "spark.sql.streamingQueryListeners" similar to > "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to > register custom streaming listeners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24480) Add a config to register custom StreamingQueryListeners
[ https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24480: - Fix Version/s: (was: 2.4.0) > Add a config to register custom StreamingQueryListeners > --- > > Key: SPARK-24480 > URL: https://issues.apache.org/jira/browse/SPARK-24480 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Minor > > Currently a "StreamingQueryListener" can only be registered programatically. > We could have a new config "spark.sql.streamingQueryListeners" similar to > "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to > register custom streaming listeners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24479) Register StreamingQueryListener in Spark Conf
[ https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504120#comment-16504120 ] Arun Mahadevan commented on SPARK-24479: PR raised - https://github.com/apache/spark/pull/21504 > Register StreamingQueryListener in Spark Conf > -- > > Key: SPARK-24479 > URL: https://issues.apache.org/jira/browse/SPARK-24479 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Mingjie Tang >Priority: Major > Labels: feature > > Users need to register their own StreamingQueryListener into > StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS > and QUERY_EXECUTION_LISTENERS. > We propose to provide STREAMING_QUERY_LISTENER Conf for user to register > their own listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24479) Register StreamingQueryListener in Spark Conf
[ https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24479: Assignee: Apache Spark > Register StreamingQueryListener in Spark Conf > -- > > Key: SPARK-24479 > URL: https://issues.apache.org/jira/browse/SPARK-24479 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Mingjie Tang >Assignee: Apache Spark >Priority: Major > Labels: feature > > Users need to register their own StreamingQueryListener into > StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS > and QUERY_EXECUTION_LISTENERS. > We propose to provide STREAMING_QUERY_LISTENER Conf for user to register > their own listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24479) Register StreamingQueryListener in Spark Conf
[ https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504119#comment-16504119 ] Apache Spark commented on SPARK-24479: -- User 'arunmahadevan' has created a pull request for this issue: https://github.com/apache/spark/pull/21504 > Register StreamingQueryListener in Spark Conf > -- > > Key: SPARK-24479 > URL: https://issues.apache.org/jira/browse/SPARK-24479 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Mingjie Tang >Priority: Major > Labels: feature > > Users need to register their own StreamingQueryListener into > StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS > and QUERY_EXECUTION_LISTENERS. > We propose to provide STREAMING_QUERY_LISTENER Conf for user to register > their own listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24479) Register StreamingQueryListener in Spark Conf
[ https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24479: Assignee: (was: Apache Spark) > Register StreamingQueryListener in Spark Conf > -- > > Key: SPARK-24479 > URL: https://issues.apache.org/jira/browse/SPARK-24479 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Mingjie Tang >Priority: Major > Labels: feature > > Users need to register their own StreamingQueryListener into > StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS > and QUERY_EXECUTION_LISTENERS. > We propose to provide STREAMING_QUERY_LISTENER Conf for user to register > their own listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24480) Add a config to register custom StreamingQueryListeners
[ https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Mahadevan resolved SPARK-24480. Resolution: Duplicate Issue is already raised. Duplicate of SPARK-24479 > Add a config to register custom StreamingQueryListeners > --- > > Key: SPARK-24480 > URL: https://issues.apache.org/jira/browse/SPARK-24480 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Minor > Fix For: 2.4.0 > > > Currently a "StreamingQueryListener" can only be registered programatically. > We could have a new config "spark.sql.streamingQueryListeners" similar to > "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to > register custom streaming listeners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Description: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://";, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Exception: {code:java} 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" grows beyond 64 KB org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" grows beyond 64 KB at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193) at com.google.common.cache.LocalCache.get(LocalCache.java:3932) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1392) at org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$3.apply(SparkPlan.scala:167) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:164) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:61) at org.apache.
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Description: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://";, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Exception: {code:java} 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" grows beyond 64 KB org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" grows beyond 64 KB{code} Log file is attached was: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://";, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 and Databricks Cloud 4.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://";, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_n
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Environment: Emr 5.13.0 and Databricks Cloud 4.0 (was: Emr 5.13.0) > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 and Databricks Cloud 4.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://";, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300){ > a = a + s"when action like '$i%' THEN '$i' " > } > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > Log file is attached -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Description: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://";, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached was: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} // Databricks notebook source import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://";, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://";, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300){ > a = a + s"when action like '$i%' THEN '$i' " > } > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > Log file is attached -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Description: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} // Databricks notebook source import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://";, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300){ a = a + s"when action like '$i%' THEN '$i' " } a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached was: Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} // Databricks notebook source import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://";, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300) a = a + s"when action like '$i%' THEN '$i' " a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > // Databricks notebook source > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://";, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300){ > a = a + s"when action like '$i%' THEN '$i' " > } > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > Log file is attached --
[jira] [Updated] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Conegliano updated SPARK-24481: -- Attachment: log4j-active(1).log > GeneratedIteratorForCodegenStage1 grows beyond 64 KB > > > Key: SPARK-24481 > URL: https://issues.apache.org/jira/browse/SPARK-24481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Emr 5.13.0 >Reporter: Andrew Conegliano >Priority: Major > Attachments: log4j-active(1).log > > > Similar to other "grows beyond 64 KB" errors. Happens with large case > statement: > {code:java} > // Databricks notebook source > import org.apache.spark.sql.functions._ > import scala.collection.mutable > import org.apache.spark.sql.Column > var rdd = sc.parallelize(Array("""{ > "event": > { > "timestamp": 1521086591110, > "event_name": "yu", > "page": > { > "page_url": "https://";, > "page_name": "es" > }, > "properties": > { > "id": "87", > "action": "action", > "navigate_action": "navigate_action" > } > } > } > """)) > var df = spark.read.json(rdd) > df = > df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") > .toDF("id","event_time","url","action","page_name","event_name","navigation_action") > var a = "case " > for(i <- 1 to 300) > a = a + s"when action like '$i%' THEN '$i' " > a = a + " else null end as task_id" > val expression = expr(a) > df = df.filter("id is not null and id <> '' and event_time is not null") > val transformationExpressions: mutable.HashMap[String, Column] = > mutable.HashMap( > "action" -> expr("coalesce(action, navigation_action) as action"), > "task_id" -> expression > ) > for((col, expr) <- transformationExpressions) > df = df.withColumn(col, expr) > df = df.filter("(action is not null and action <> '') or (page_name is not > null and page_name <> '')") > df.show > {code} > Log file is attached -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24480) Add a config to register custom StreamingQueryListeners
[ https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24480: Assignee: (was: Apache Spark) > Add a config to register custom StreamingQueryListeners > --- > > Key: SPARK-24480 > URL: https://issues.apache.org/jira/browse/SPARK-24480 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Minor > Fix For: 2.4.0 > > > Currently a "StreamingQueryListener" can only be registered programatically. > We could have a new config "spark.sql.streamingQueryListeners" similar to > "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to > register custom streaming listeners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24480) Add a config to register custom StreamingQueryListeners
[ https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504113#comment-16504113 ] Apache Spark commented on SPARK-24480: -- User 'arunmahadevan' has created a pull request for this issue: https://github.com/apache/spark/pull/21504 > Add a config to register custom StreamingQueryListeners > --- > > Key: SPARK-24480 > URL: https://issues.apache.org/jira/browse/SPARK-24480 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Minor > Fix For: 2.4.0 > > > Currently a "StreamingQueryListener" can only be registered programatically. > We could have a new config "spark.sql.streamingQueryListeners" similar to > "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to > register custom streaming listeners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24480) Add a config to register custom StreamingQueryListeners
[ https://issues.apache.org/jira/browse/SPARK-24480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24480: Assignee: Apache Spark > Add a config to register custom StreamingQueryListeners > --- > > Key: SPARK-24480 > URL: https://issues.apache.org/jira/browse/SPARK-24480 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Assignee: Apache Spark >Priority: Minor > Fix For: 2.4.0 > > > Currently a "StreamingQueryListener" can only be registered programatically. > We could have a new config "spark.sql.streamingQueryListeners" similar to > "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to > register custom streaming listeners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB
Andrew Conegliano created SPARK-24481: - Summary: GeneratedIteratorForCodegenStage1 grows beyond 64 KB Key: SPARK-24481 URL: https://issues.apache.org/jira/browse/SPARK-24481 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: Emr 5.13.0 Reporter: Andrew Conegliano Attachments: log4j-active(1).log Similar to other "grows beyond 64 KB" errors. Happens with large case statement: {code:java} // Databricks notebook source import org.apache.spark.sql.functions._ import scala.collection.mutable import org.apache.spark.sql.Column var rdd = sc.parallelize(Array("""{ "event": { "timestamp": 1521086591110, "event_name": "yu", "page": { "page_url": "https://";, "page_name": "es" }, "properties": { "id": "87", "action": "action", "navigate_action": "navigate_action" } } } """)) var df = spark.read.json(rdd) df = df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action") .toDF("id","event_time","url","action","page_name","event_name","navigation_action") var a = "case " for(i <- 1 to 300) a = a + s"when action like '$i%' THEN '$i' " a = a + " else null end as task_id" val expression = expr(a) df = df.filter("id is not null and id <> '' and event_time is not null") val transformationExpressions: mutable.HashMap[String, Column] = mutable.HashMap( "action" -> expr("coalesce(action, navigation_action) as action"), "task_id" -> expression ) for((col, expr) <- transformationExpressions) df = df.withColumn(col, expr) df = df.filter("(action is not null and action <> '') or (page_name is not null and page_name <> '')") df.show {code} Log file is attached -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24480) Add a config to register custom StreamingQueryListeners
Arun Mahadevan created SPARK-24480: -- Summary: Add a config to register custom StreamingQueryListeners Key: SPARK-24480 URL: https://issues.apache.org/jira/browse/SPARK-24480 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Arun Mahadevan Fix For: 2.4.0 Currently a "StreamingQueryListener" can only be registered programatically. We could have a new config "spark.sql.streamingQueryListeners" similar to "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to register custom streaming listeners. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24479) Register StreamingQueryListener in Spark Conf
Mingjie Tang created SPARK-24479: Summary: Register StreamingQueryListener in Spark Conf Key: SPARK-24479 URL: https://issues.apache.org/jira/browse/SPARK-24479 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Mingjie Tang Users need to register their own StreamingQueryListener into StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS and QUERY_EXECUTION_LISTENERS. We propose to provide STREAMING_QUERY_LISTENER Conf for user to register their own listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion
[ https://issues.apache.org/jira/browse/SPARK-24478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24478: Assignee: (was: Apache Spark) > DataSourceV2 should push filters and projection at physical plan conversion > --- > > Key: SPARK-24478 > URL: https://issues.apache.org/jira/browse/SPARK-24478 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 currently pushes filters and projected columns in the optimized > plan, but this requires creating and configuring a reader multiple times and > prevents the v2 relation's output from being a fixed argument of the relation > case class. It is also much cleaner (see PR > [#21262|https://github.com/apache/spark/pull/21262]). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion
[ https://issues.apache.org/jira/browse/SPARK-24478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503874#comment-16503874 ] Apache Spark commented on SPARK-24478: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/21503 > DataSourceV2 should push filters and projection at physical plan conversion > --- > > Key: SPARK-24478 > URL: https://issues.apache.org/jira/browse/SPARK-24478 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 currently pushes filters and projected columns in the optimized > plan, but this requires creating and configuring a reader multiple times and > prevents the v2 relation's output from being a fixed argument of the relation > case class. It is also much cleaner (see PR > [#21262|https://github.com/apache/spark/pull/21262]). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion
[ https://issues.apache.org/jira/browse/SPARK-24478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24478: Assignee: Apache Spark > DataSourceV2 should push filters and projection at physical plan conversion > --- > > Key: SPARK-24478 > URL: https://issues.apache.org/jira/browse/SPARK-24478 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > DataSourceV2 currently pushes filters and projected columns in the optimized > plan, but this requires creating and configuring a reader multiple times and > prevents the v2 relation's output from being a fixed argument of the relation > case class. It is also much cleaner (see PR > [#21262|https://github.com/apache/spark/pull/21262]). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them
[ https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503835#comment-16503835 ] Joel Croteau commented on SPARK-24357: -- [~viirya], yes that's what I said. What I am saying is that when an integer that is too large to fit in a long occurs, the schema should either be inferred as a type that can hold it, or it should be a run-time error. > createDataFrame in Python infers large integers as long type and then fails > silently when converting them > - > > Key: SPARK-24357 > URL: https://issues.apache.org/jira/browse/SPARK-24357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joel Croteau >Priority: Major > > When inferring the schema type of an RDD passed to createDataFrame, PySpark > SQL will infer any integral type as a LongType, which is a 64-bit integer, > without actually checking whether the values will fit into a 64-bit slot. If > the values are larger than 64 bits, then when pickled and unpickled in Java, > Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is > called, it will ignore the BigInteger type and return Null. This results in > any large integers in the resulting DataFrame being silently converted to > None. This can create some very surprising and difficult to debug behavior, > in particular if you are not aware of this limitation. There should either be > a runtime error at some point in this conversion chain, or else _infer_type > should infer larger integers as DecimalType with appropriate precision, or as > BinaryType. The former would be less convenient, but the latter may be > problematic to implement in practice. In any case, we should stop silently > converting large integers to None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion
Ryan Blue created SPARK-24478: - Summary: DataSourceV2 should push filters and projection at physical plan conversion Key: SPARK-24478 URL: https://issues.apache.org/jira/browse/SPARK-24478 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Ryan Blue DataSourceV2 currently pushes filters and projected columns in the optimized plan, but this requires creating and configuring a reader multiple times and prevents the v2 relation's output from being a fixed argument of the relation case class. It is also much cleaner (see PR [#21262|https://github.com/apache/spark/pull/21262]). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24469) Support collations in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503723#comment-16503723 ] Alexander Shkapsky commented on SPARK-24469: *first* or *min* on a StringType column gets planned with SortAggregateExec due to the StringType aggregate result. So same performance problems. > Support collations in Spark SQL > --- > > Key: SPARK-24469 > URL: https://issues.apache.org/jira/browse/SPARK-24469 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Shkapsky >Priority: Major > > One of our use cases is to support case-insensitive comparison in operations, > including aggregation and text comparison filters. Another use case is to > sort via collator. Support for collations throughout the query processor > appear to be the proper way to support these needs. > Language-based worked arounds (for the aggregation case) are insufficient: > # SELECT UPPER(text)GROUP BY UPPER(text) > introduces invalid values into the output set > # SELECT MIN(text)...GROUP BY UPPER(text) > results in poor performance in our case, in part due to use of sort-based > aggregate > Examples of collation support in RDBMS: > * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html] > * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html] > * > [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html] > * [SQL > Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017] > * > [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24477) Import submodules under pyspark.ml by default
[ https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503710#comment-16503710 ] Hyukjin Kwon commented on SPARK-24477: -- Thanks for cc'ing me, [~mengxr]. Will do this too tomorrow. > Import submodules under pyspark.ml by default > - > > Key: SPARK-24477 > URL: https://issues.apache.org/jira/browse/SPARK-24477 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > Right now, we do not import submodules under pyspark.ml by default. So users > cannot do > {code} > from pyspark import ml > kmeans = ml.clustering.KMeans(...) > {code} > I create this JIRA to discuss if we should import the submodules by default. > It will change behavior of > {code} > from pyspark.ml import * > {code} > But it simplifies unnecessary imports. > cc [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24477) Import submodules under pyspark.ml by default
[ https://issues.apache.org/jira/browse/SPARK-24477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24477: -- Description: Right now, we do not import submodules under pyspark.ml by default. So users cannot do {code} from pyspark import ml kmeans = ml.clustering.KMeans(...) {code} I create this JIRA to discuss if we should import the submodules by default. It will change behavior of {code} from pyspark.ml import * {code} But it simplifies unnecessary imports. cc [~hyukjin.kwon] was: Right now, we do not import submodules under pyspark.ml by default. So users cannot do {code} from pyspark import ml kmeans = ml.clustering.KMeans(...) {code} I create this JIRA to discuss if we should import the submodules by default. It will change behavior of {code} from pyspark.ml import * {code} cc [~hyukjin.kwon] > Import submodules under pyspark.ml by default > - > > Key: SPARK-24477 > URL: https://issues.apache.org/jira/browse/SPARK-24477 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > Right now, we do not import submodules under pyspark.ml by default. So users > cannot do > {code} > from pyspark import ml > kmeans = ml.clustering.KMeans(...) > {code} > I create this JIRA to discuss if we should import the submodules by default. > It will change behavior of > {code} > from pyspark.ml import * > {code} > But it simplifies unnecessary imports. > cc [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24454) ml.image doesn't have __all__ explicitly defined
[ https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24454: -- Issue Type: Improvement (was: Bug) > ml.image doesn't have __all__ explicitly defined > > > Key: SPARK-24454 > URL: https://issues.apache.org/jira/browse/SPARK-24454 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Priority: Minor > > ml/image.py doesn't have __all__ explicitly defined. It will import all > global names by default (only ImageSchema for now), which is not a good > practice. We should add __all__ to image.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24454) ml.image doesn't have __all__ explicitly defined
[ https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503700#comment-16503700 ] Xiangrui Meng commented on SPARK-24454: --- Updated this JIRA and created SPARK-24477. > ml.image doesn't have __all__ explicitly defined > > > Key: SPARK-24454 > URL: https://issues.apache.org/jira/browse/SPARK-24454 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Priority: Major > > ml/image.py doesn't have __all__ explicitly defined. It will import all > global names by default (only ImageSchema for now), which is not a good > practice. We should add __all__ to image.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24454) ml.image doesn't have __all__ explicitly defined
[ https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24454: -- Priority: Minor (was: Major) > ml.image doesn't have __all__ explicitly defined > > > Key: SPARK-24454 > URL: https://issues.apache.org/jira/browse/SPARK-24454 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Priority: Minor > > ml/image.py doesn't have __all__ explicitly defined. It will import all > global names by default (only ImageSchema for now), which is not a good > practice. We should add __all__ to image.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24477) Import submodules under pyspark.ml by default
Xiangrui Meng created SPARK-24477: - Summary: Import submodules under pyspark.ml by default Key: SPARK-24477 URL: https://issues.apache.org/jira/browse/SPARK-24477 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.4.0 Reporter: Xiangrui Meng Right now, we do not import submodules under pyspark.ml by default. So users cannot do {code} from pyspark import ml kmeans = ml.clustering.KMeans(...) {code} I create this JIRA to discuss if we should import the submodules by default. It will change behavior of {code} from pyspark.ml import * {code} cc [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24454) ml.image doesn't have __all__ explicitly defined
[ https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24454: -- Description: ml/image.py doesn't have __all__ explicitly defined. It will import all global names by default (only ImageSchema for now), which is not a good practice. We should add __all__ to image.py. (was: {code:java} from pyspark import ml ml.image{code} The ml.image line will fail because we don't have __all__ defined there. ) > ml.image doesn't have __all__ explicitly defined > > > Key: SPARK-24454 > URL: https://issues.apache.org/jira/browse/SPARK-24454 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Priority: Major > > ml/image.py doesn't have __all__ explicitly defined. It will import all > global names by default (only ImageSchema for now), which is not a good > practice. We should add __all__ to image.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24454) ml.image doesn't have __all__ explicitly defined
[ https://issues.apache.org/jira/browse/SPARK-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24454: -- Summary: ml.image doesn't have __all__ explicitly defined (was: ml.image doesn't have default imports) > ml.image doesn't have __all__ explicitly defined > > > Key: SPARK-24454 > URL: https://issues.apache.org/jira/browse/SPARK-24454 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Xiangrui Meng >Priority: Major > > {code:java} > from pyspark import ml > ml.image{code} > The ml.image line will fail because we don't have __all__ defined there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24469) Support collations in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503687#comment-16503687 ] Eric Maynard commented on SPARK-24469: -- Ah, I see, I was wrongly thinking of the second case where you use e.g. MIN to get some legitimate input value. But I can see how *min* would yield bad performance. Maybe try *first* instead? > Support collations in Spark SQL > --- > > Key: SPARK-24469 > URL: https://issues.apache.org/jira/browse/SPARK-24469 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Shkapsky >Priority: Major > > One of our use cases is to support case-insensitive comparison in operations, > including aggregation and text comparison filters. Another use case is to > sort via collator. Support for collations throughout the query processor > appear to be the proper way to support these needs. > Language-based worked arounds (for the aggregation case) are insufficient: > # SELECT UPPER(text)GROUP BY UPPER(text) > introduces invalid values into the output set > # SELECT MIN(text)...GROUP BY UPPER(text) > results in poor performance in our case, in part due to use of sort-based > aggregate > Examples of collation support in RDBMS: > * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html] > * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html] > * > [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html] > * [SQL > Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017] > * > [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503678#comment-16503678 ] Perry Chu commented on SPARK-24447: --- Do you mean building from source code? I haven't done that before, but I can try. If you can suggest a good guide to follow, I would appreciate it. > Pyspark RowMatrix.columnSimilarities() loses spark context > -- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.3.0 >Reporter: Perry Chu >Priority: Major > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > This snippet should reproduce the error: > {code:java} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) > matrix = RowMatrix(rows) > sims = matrix.columnSimilarities() > ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) > print(sims.numRows(),sims.numCols()) > ## This throws an error (stack trace below) > print(sims.entries.first()) > ## Later I tried this > print(rows.context) # > print(sims.entries.context) # PySparkShell>, then throws an error{code} > Error stack trace > {code:java} > --- > AttributeError Traceback (most recent call last) > in () > > 1 sims.entries.first() > /usr/lib/spark/python/pyspark/rdd.py in first(self) > 1374 ValueError: RDD is empty > 1375 """ > -> 1376 rs = self.take(1) > 1377 if rs: > 1378 return rs[0] > /usr/lib/spark/python/pyspark/rdd.py in take(self, num) > 1356 > 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) > -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) > 1359 > 1360 items += res > /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, > partitions, allowLocal) > 999 # SparkContext#runJob. > 1000 mappedRDD = rdd.mapPartitions(partitionFunc) > -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) > 1003 > AttributeError: 'NoneType' object has no attribute 'sc' > {code} > PySpark columnSimilarities documentation > http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility
[ https://issues.apache.org/jira/browse/SPARK-24466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-24466: - Target Version/s: 2.4.0 > TextSocketMicroBatchReader no longer works with nc utility > -- > > Key: SPARK-24466 > URL: https://issues.apache.org/jira/browse/SPARK-24466 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before > reading actual data so the query also exits with error. > > The reason is due to launching temporary reader for reading schema, and > closing reader, and re-opening reader. While reliable socket server should be > able to handle this without any issue, nc command normally can't handle > multiple connections and simply exits when closing temporary reader. > > Given that socket source is expected to be used from examples on official > document or some experiments, which we tend to simply use netcat, this is > better to be treated as bug, though this is a kind of limitation on netcat. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503647#comment-16503647 ] Jiang Xingbo commented on SPARK-24375: -- The major problem is that tasks in the same stage of a MPI workload may rely on the internal results of other parallel running folk tasks to compute the final results, thus when a task fail, other tasks in the same stage may generate incorrect result or even hang, and it seems to be straight-forward to just retry the whole stage on task failure. > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24279) Incompatible byte code errors, when using test-jar of spark sql.
[ https://issues.apache.org/jira/browse/SPARK-24279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503630#comment-16503630 ] Shixiong Zhu commented on SPARK-24279: -- Just did a quick look at your pom.xml. I think it's missing the following catalyst test jar. {code} org.apache.spark spark-catalyst_${scala.binary.version} ${project.version} test-jar test {code} > Incompatible byte code errors, when using test-jar of spark sql. > > > Key: SPARK-24279 > URL: https://issues.apache.org/jira/browse/SPARK-24279 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming, Tests >Affects Versions: 2.3.0, 2.4.0 >Reporter: Prashant Sharma >Priority: Major > > Using test libraries available with spark sql streaming, produces weird > incompatible byte code errors. It is already tested on a different virtual > box instance to make sure, that it is not related to my system environment. A > reproducer is uploaded to github. > [https://github.com/ScrapCodes/spark-bug-reproducer] > > Doing a clean build reproduces the error. > > Verbatim paste of the error. > {code:java} > [INFO] Compiling 1 source files to > /home/prashant/work/test/target/test-classes at 1526380360990 > [ERROR] error: missing or invalid dependency detected while loading class > file 'QueryTest.class'. > [INFO] Could not access type PlanTest in package > org.apache.spark.sql.catalyst.plans, > [INFO] because it (or its dependencies) are missing. Check your build > definition for > [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to > see the problematic classpath.) > [INFO] A full rebuild may help if 'QueryTest.class' was compiled against an > incompatible version of org.apache.spark.sql.catalyst.plans. > [ERROR] error: missing or invalid dependency detected while loading class > file 'SQLTestUtilsBase.class'. > [INFO] Could not access type PlanTestBase in package > org.apache.spark.sql.catalyst.plans, > [INFO] because it (or its dependencies) are missing. Check your build > definition for > [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to > see the problematic classpath.) > [INFO] A full rebuild may help if 'SQLTestUtilsBase.class' was compiled > against an incompatible version of org.apache.spark.sql.catalyst.plans. > [ERROR] error: missing or invalid dependency detected while loading class > file 'SQLTestUtils.class'. > [INFO] Could not access type PlanTest in package > org.apache.spark.sql.catalyst.plans, > [INFO] because it (or its dependencies) are missing. Check your build > definition for > [INFO] missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to > see the problematic classpath.) > [INFO] A full rebuild may help if 'SQLTestUtils.class' was compiled against > an incompatible version of org.apache.spark.sql.catalyst.plans. > [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:25: > error: Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will be added > in future releases. > [ERROR] val inputData = MemoryStream[Int] > [ERROR] ^ > [ERROR] /home/prashant/work/test/src/test/scala/SparkStreamingTests.scala:30: > error: Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will be added > in future releases. > [ERROR] CheckAnswer(2, 3, 4)) > [ERROR] ^ > [ERROR] 5 errors found > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24469) Support collations in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503621#comment-16503621 ] Alexander Shkapsky commented on SPARK-24469: A simple case with a single input value "George Washington" - UPPER("George Washington") will produce a group with "GEORGE WASHINGTON". "GEORGE WASHINGTON" is not present in the input. > Support collations in Spark SQL > --- > > Key: SPARK-24469 > URL: https://issues.apache.org/jira/browse/SPARK-24469 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Shkapsky >Priority: Major > > One of our use cases is to support case-insensitive comparison in operations, > including aggregation and text comparison filters. Another use case is to > sort via collator. Support for collations throughout the query processor > appear to be the proper way to support these needs. > Language-based worked arounds (for the aggregation case) are insufficient: > # SELECT UPPER(text)GROUP BY UPPER(text) > introduces invalid values into the output set > # SELECT MIN(text)...GROUP BY UPPER(text) > results in poor performance in our case, in part due to use of sort-based > aggregate > Examples of collation support in RDBMS: > * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html] > * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html] > * > [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html] > * [SQL > Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017] > * > [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24469) Support collations in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-24469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503586#comment-16503586 ] Eric Maynard commented on SPARK-24469: -- bq. SELECT UPPER(text)GROUP BY UPPER(text) bq. introduces invalid values into the output set Can you elaborate on this? > Support collations in Spark SQL > --- > > Key: SPARK-24469 > URL: https://issues.apache.org/jira/browse/SPARK-24469 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Shkapsky >Priority: Major > > One of our use cases is to support case-insensitive comparison in operations, > including aggregation and text comparison filters. Another use case is to > sort via collator. Support for collations throughout the query processor > appear to be the proper way to support these needs. > Language-based worked arounds (for the aggregation case) are insufficient: > # SELECT UPPER(text)GROUP BY UPPER(text) > introduces invalid values into the output set > # SELECT MIN(text)...GROUP BY UPPER(text) > results in poor performance in our case, in part due to use of sort-based > aggregate > Examples of collation support in RDBMS: > * [PostgreSQL|https://www.postgresql.org/docs/10/static/collation.html] > * [MySQL|https://dev.mysql.com/doc/refman/8.0/en/charset.html] > * > [Oracle|https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/linguistic-sorting-and-matching.html] > * [SQL > Server|https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017] > * > [DB2|https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.nls.doc/com.ibm.db2.luw.admin.nls.doc-gentopic2.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bharath kumar avusherla updated SPARK-24476: Description: We are working on spark streaming application using spark structured streaming with checkpointing in s3. When we start the application, the application runs just fine for sometime then it crashes with the error mentioned below. The amount of time it will run successfully varies from time to time, sometimes it will run for 2 days without any issues then crashes, sometimes it will crash after 4hrs/ 24hrs. Our streaming application joins(left and inner) multiple sources from kafka and also s3 and aurora database. Can you please let us know how to solve this problem? Is it possible to somehow tweak the SocketTimeout-Time? Here, I'm pasting the few line of complete exception log below. Also attached the complete exception to the issue. *_Exception:_* *_Caused by: java.net.SocketTimeoutException: Read timed out_* _at java.net.SocketInputStream.socketRead0(Native Method)_ _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ _at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ was: We are working on spark streaming application using spark structured streaming with checkpointing in s3. When we start the application, the application runs just fine for sometime then it crashes with the error mentioned below. The amount of time it will run successfully varies from time to time, sometimes it will run for 2 days without any issues then crashes, sometimes it will crash after 4hrs/ 24hrs. Our streaming application joins(left and inner) multiple sources from kafka and also s3 and aurora database. Can you please let us know how to solve this problem? Is it possible to increase the timeout period? Here, I'm pasting the few line of complete exception log below. Also attached the complete exception to the issue. *_Exception:_* *_Caused by: java.net.SocketTimeoutException: Read timed out_* _at java.net.SocketInputStream.socketRead0(Native Method)_ _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ _at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ > java.net.SocketTimeoutException: Read timed out Exception while running the > Spark Structured Streaming in 2.3.0 > --- > > Key: SPARK-24476 > URL: https://issues.apache.org/jira/browse/SPARK-24476 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Major > Attachments: socket-timeout-exception > > > We are working on spark streaming application using spark structured > streaming with checkpointing in s3. When we start the application, the > application runs just fine for sometime then it crashes with the error > mentioned below. The amount of time it will run successfully varies from time > to time, sometimes it will run for 2 days without any issues then crashes, > sometimes it will crash after 4hrs/ 24hrs. > Our streaming application joins(left and inner) multiple sources from kafka > and also s3 and aurora database. > Can you please let us know how to solve this problem? > Is it possible to somehow tweak the SocketTimeout-Time? > Here, I'm pasting the few line of
[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bharath kumar avusherla updated SPARK-24476: Description: We are working on spark streaming application using spark structured streaming with checkpointing in s3. When we start the application, the application runs just fine for sometime then it crashes with the error mentioned below. The amount of time it will run successfully varies from time to time, sometimes it will run for 2 days without any issues then crashes, sometimes it will crash after 4hrs/ 24hrs. Our streaming application joins(left and inner) multiple sources from kafka and also s3 and aurora database. Can you please let us know how to solve this problem? Is it possible to increase the timeout period? Here, I'm pasting the few line of complete exception log below. Also attached the complete exception to the issue. *_Exception:_* *_Caused by: java.net.SocketTimeoutException: Read timed out_* _at java.net.SocketInputStream.socketRead0(Native Method)_ _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ _at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ was: We are working on spark streaming application using spark structured streaming with checkpointing in s3. When we start the application, the application runs just fine for sometime then it crashes with the error mentioned below. The amount of time it will run successfully varies from time to time, sometimes it will run for 2 days without any issues then crashes, sometimes it will crash after 4hrs/ 24hrs. Our streaming application joins(left and inner) multiple sources from kafka and also s3 and aurora database. Can you please let us know how to solve this problem? Is it possible to increase the timeout period? Here, I'm pasting the complete exception log below. *_Exception:_* *_Caused by: java.net.SocketTimeoutException: Read timed out_* _at java.net.SocketInputStream.socketRead0(Native Method)_ _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ _at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ _at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:179)_ _at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144)_ _at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:134)_ _at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:612)_ _at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:447)_ _at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)_ _at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)_ _at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)_ _at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)_ _at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)_ _at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942)_ _at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148)_ _at org.jets3t.service.im
[jira] [Updated] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bharath kumar avusherla updated SPARK-24476: Attachment: socket-timeout-exception > java.net.SocketTimeoutException: Read timed out Exception while running the > Spark Structured Streaming in 2.3.0 > --- > > Key: SPARK-24476 > URL: https://issues.apache.org/jira/browse/SPARK-24476 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Major > Attachments: socket-timeout-exception > > > We are working on spark streaming application using spark structured > streaming with checkpointing in s3. When we start the application, the > application runs just fine for sometime then it crashes with the error > mentioned below. The amount of time it will run successfully varies from time > to time, sometimes it will run for 2 days without any issues then crashes, > sometimes it will crash after 4hrs/ 24hrs. > Our streaming application joins(left and inner) multiple sources from kafka > and also s3 and aurora database. > Can you please let us know how to solve this problem? Is it possible to > increase the timeout period? > Here, I'm pasting the complete exception log below. > *_Exception:_* > *_Caused by: java.net.SocketTimeoutException: Read timed out_* > _at java.net.SocketInputStream.socketRead0(Native Method)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ > _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ > _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ > _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ > _at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ > _at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:179)_ > _at > org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144)_ > _at > org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:134)_ > _at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:612)_ > _at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:447)_ > _at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)_ > _at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)_ > _at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)_ > _at > org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)_ > _at > org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)_ > _at > org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942)_ > _at > org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148)_ > _at > org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2075)_ > _at > org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1093)_ > _at > org.jets3t.service.StorageService.getObjectDetails(StorageService.java:548)_ > _at > org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)_ > _at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)_ > _at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)_ > _at java.lang.reflect.Method.invoke(Method.java:483)_ > _at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)_ > _at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)_ > _at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)_ > _at > org.apache.hadoop.io.retry.Retr
[jira] [Created] (SPARK-24476) java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0
bharath kumar avusherla created SPARK-24476: --- Summary: java.net.SocketTimeoutException: Read timed out Exception while running the Spark Structured Streaming in 2.3.0 Key: SPARK-24476 URL: https://issues.apache.org/jira/browse/SPARK-24476 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: bharath kumar avusherla We are working on spark streaming application using spark structured streaming with checkpointing in s3. When we start the application, the application runs just fine for sometime then it crashes with the error mentioned below. The amount of time it will run successfully varies from time to time, sometimes it will run for 2 days without any issues then crashes, sometimes it will crash after 4hrs/ 24hrs. Our streaming application joins(left and inner) multiple sources from kafka and also s3 and aurora database. Can you please let us know how to solve this problem? Is it possible to increase the timeout period? Here, I'm pasting the complete exception log below. *_Exception:_* *_Caused by: java.net.SocketTimeoutException: Read timed out_* _at java.net.SocketInputStream.socketRead0(Native Method)_ _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ _at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ _at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ _at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ _at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:179)_ _at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144)_ _at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:134)_ _at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:612)_ _at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:447)_ _at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)_ _at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)_ _at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)_ _at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)_ _at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)_ _at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942)_ _at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148)_ _at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2075)_ _at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1093)_ _at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:548)_ _at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)_ _at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)_ _at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)_ _at java.lang.reflect.Method.invoke(Method.java:483)_ _at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)_ _at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)_ _at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)_ _at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)_ _at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)_ _at org.apache.hadoop.fs.s3native.$Proxy18.retrieveMetadata(Unknown Source)_ _at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:493)_ _at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437)_ _at org
[jira] [Created] (SPARK-24475) Nested JSON count() Exception
Joseph Toth created SPARK-24475: --- Summary: Nested JSON count() Exception Key: SPARK-24475 URL: https://issues.apache.org/jira/browse/SPARK-24475 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Joseph Toth I have nested structure json file only 2 rows. {{spark = SparkSession.builder.appName("JSONRead").getOrCreate()}} {{jsonData = spark.read.json(file)}} {{jsonData.count() will crash with the following exception, jsonData.head(10) works.}}{{}} Traceback (most recent call last): File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in jsonData.count() File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py", line 455, in count return int(self._jdf.count()) File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1160, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 320, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o411.count. : java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.(Unknown Source) at org.apache.xbean.asm5.ClassReader.(Unknown Source) at org.apache.xbean.asm5.ClassReader.(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432) at org.apache.xbean.asm5.ClassReader.a(Unknown Source) at org.apache.xbean.asm5.ClassReader.b(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262) at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2292) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:938) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2770) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2769) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at org.apache.spark.sql.Dataset.count(Dataset.scala:2769) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.in
[jira] [Comment Edited] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503486#comment-16503486 ] Li Yuanjian edited comment on SPARK-24375 at 6/6/18 3:55 PM: - Hi [~cloud_fan] and [~jiangxb1987], just I tiny question here, I notice the discussion in SPARK-20928 mentioned the barrier scheduling which Continuous Processing will depend on. {quote} A barrier stage doesn’t launch any of its tasks until the available slots(free CPU cores can be used to launch pending tasks) satisfies the target to launch all the tasks at the same time, and always retry the whole stage when any task(s) fail. {quote} Why the task level retrying was forbidden here, is there any possible to achieve this? Thanks. was (Author: xuanyuan): Hi [~cloud_fan] and [~jiangxb1987], just I tiny question here, I notice the discussion in SPARK-20928 mentioned the barrier scheduling. {quote} A barrier stage doesn’t launch any of its tasks until the available slots(free CPU cores can be used to launch pending tasks) satisfies the target to launch all the tasks at the same time, and always retry the whole stage when any task(s) fail. {quote} Why the task level retrying was forbidden here, is there any possible to achieve this? Thanks. > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24375) Design sketch: support barrier scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503486#comment-16503486 ] Li Yuanjian commented on SPARK-24375: - Hi [~cloud_fan] and [~jiangxb1987], just I tiny question here, I notice the discussion in SPARK-20928 mentioned the barrier scheduling. {quote} A barrier stage doesn’t launch any of its tasks until the available slots(free CPU cores can be used to launch pending tasks) satisfies the target to launch all the tasks at the same time, and always retry the whole stage when any task(s) fail. {quote} Why the task level retrying was forbidden here, is there any possible to achieve this? Thanks. > Design sketch: support barrier scheduling in Apache Spark > - > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503466#comment-16503466 ] Sean Owen commented on SPARK-24431: --- So, the model makes the same prediction p for all examples? In that case, for points on the PR curve corresponding to thresholds less than p, everything is classified positive. Recall is 1 and precision is the fraction of examples that are actually positive. For thresholds greater than p, all is marked negative. Recall is 0, but, really precision is undefined (0/0). Yes, see SPARK-21806 for a discussion of the different ways you could try to handle this – no point for recall = 0, or precision = 0, 1, or p. We (mostly I) settled on the latter as the least surprising change and one most likely to produce model comparisons that make intuitive sense. My argument against precision = 0 at recall = 0 is that it doesn't quite make sense that precision drops as recall decreases, and that would define precision as the smallest possible value. You're right that this is a corner case, and there is no really correct way to handle it. I'd say the issue here is perhaps leaning too much on the area under the PR curve. It doesn't have as much meaning as the area under the ROC curve. I think it's just the extreme example of the problem with comparing precision and recall? > wrong areaUnderPR calculation in BinaryClassificationEvaluator > --- > > Key: SPARK-24431 > URL: https://issues.apache.org/jira/browse/SPARK-24431 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Xinyong Tian >Priority: Major > > My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., > evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) to > select best model. when the regParam in logistict regression is very high, no > variable will be selected (no model), ie every row 's prediction is same ,eg. > equal event rate (baseline frequency). But at this point, > BinaryClassificationEvaluator set the areaUnderPR highest. As a result best > model seleted is a no model. > the reason is following. at time of no model, precision recall curve will be > only two points: at recall =0, precision should be set to zero , while the > software set it to 1. at recall=1, precision is the event rate. As a result, > the areaUnderPR will be close 0.5 (my even rate is very low), which is > maximum . > the solution is to set precision =0 when recall =0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503445#comment-16503445 ] Dongjoon Hyun commented on SPARK-24472: --- Thank you for pinging me, [~zsxwing] > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > a
[jira] [Updated] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24472: -- Affects Version/s: 1.6.3 2.0.2 2.1.2 2.2.1 > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD
[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache
[ https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503394#comment-16503394 ] Apache Spark commented on SPARK-22575: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21502 > Making Spark Thrift Server clean up its cache > - > > Key: SPARK-22575 > URL: https://issues.apache.org/jira/browse/SPARK-22575 > Project: Spark > Issue Type: Improvement > Components: Block Manager, SQL >Affects Versions: 2.2.0 >Reporter: Oz Ben-Ami >Priority: Minor > Labels: cache, dataproc, thrift, yarn > > Currently, Spark Thrift Server accumulates data in its appcache, even for old > queries. This fills up the disk (using over 100GB per worker node) within > days, and the only way to clear it is to restart the Thrift Server > application. Even deleting the files directly isn't a solution, as Spark then > complains about FileNotFound. > I asked about this on [Stack > Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache] > a few weeks ago, but it does not seem to be currently doable by > configuration. > Am I missing some configuration option, or some other factor here? > Otherwise, can anyone point me to the code that handles this, so maybe I can > try my hand at a fix? > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22575) Making Spark Thrift Server clean up its cache
[ https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22575: Assignee: (was: Apache Spark) > Making Spark Thrift Server clean up its cache > - > > Key: SPARK-22575 > URL: https://issues.apache.org/jira/browse/SPARK-22575 > Project: Spark > Issue Type: Improvement > Components: Block Manager, SQL >Affects Versions: 2.2.0 >Reporter: Oz Ben-Ami >Priority: Minor > Labels: cache, dataproc, thrift, yarn > > Currently, Spark Thrift Server accumulates data in its appcache, even for old > queries. This fills up the disk (using over 100GB per worker node) within > days, and the only way to clear it is to restart the Thrift Server > application. Even deleting the files directly isn't a solution, as Spark then > complains about FileNotFound. > I asked about this on [Stack > Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache] > a few weeks ago, but it does not seem to be currently doable by > configuration. > Am I missing some configuration option, or some other factor here? > Otherwise, can anyone point me to the code that handles this, so maybe I can > try my hand at a fix? > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22575) Making Spark Thrift Server clean up its cache
[ https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22575: Assignee: Apache Spark > Making Spark Thrift Server clean up its cache > - > > Key: SPARK-22575 > URL: https://issues.apache.org/jira/browse/SPARK-22575 > Project: Spark > Issue Type: Improvement > Components: Block Manager, SQL >Affects Versions: 2.2.0 >Reporter: Oz Ben-Ami >Assignee: Apache Spark >Priority: Minor > Labels: cache, dataproc, thrift, yarn > > Currently, Spark Thrift Server accumulates data in its appcache, even for old > queries. This fills up the disk (using over 100GB per worker node) within > days, and the only way to clear it is to restart the Thrift Server > application. Even deleting the files directly isn't a solution, as Spark then > complains about FileNotFound. > I asked about this on [Stack > Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache] > a few weeks ago, but it does not seem to be currently doable by > configuration. > Am I missing some configuration option, or some other factor here? > Otherwise, can anyone point me to the code that handles this, so maybe I can > try my hand at a fix? > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column
[ https://issues.apache.org/jira/browse/SPARK-23803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23803: --- Assignee: Asher Saban > Support bucket pruning to optimize filtering on a bucketed column > - > > Key: SPARK-23803 > URL: https://issues.apache.org/jira/browse/SPARK-23803 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Asher Saban >Assignee: Asher Saban >Priority: Major > Fix For: 2.4.0 > > Original Estimate: 96h > Remaining Estimate: 96h > > support bucket pruning when filtering on a single bucketed column on the > following predicates - > # EqualTo > # EqualNullSafe > # In > # (1)-(3) combined in And/Or predicates > > based on [~smilegator]'s work in SPARK-12850 which was removed from the code > base. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column
[ https://issues.apache.org/jira/browse/SPARK-23803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23803. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20915 [https://github.com/apache/spark/pull/20915] > Support bucket pruning to optimize filtering on a bucketed column > - > > Key: SPARK-23803 > URL: https://issues.apache.org/jira/browse/SPARK-23803 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Asher Saban >Priority: Major > Fix For: 2.4.0 > > Original Estimate: 96h > Remaining Estimate: 96h > > support bucket pruning when filtering on a single bucketed column on the > following predicates - > # EqualTo > # EqualNullSafe > # In > # (1)-(3) combined in And/Or predicates > > based on [~smilegator]'s work in SPARK-12850 which was removed from the code > base. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15882) Discuss distributed linear algebra in spark.ml package
[ https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503256#comment-16503256 ] Kyle Prifogle edited comment on SPARK-15882 at 6/6/18 1:00 PM: --- [~josephkb] Noticing this is almost 2 years old now which gives me the impression that this isn't going to be done? If I took the time to start to bite this off does anyone thing that they would use this or are most people finding their own more "expert" solutions :D ? Seems like matrix representations and basic algebra is pretty fundamental. was (Author: kprifogle1): Noticing this is almost 2 years old now which gives me the impression that this isn't going to be done? If I took the time to start to bite this off does anyone thing that they would use this or are most people finding their own more "expert" solutions :D ? Seems like matrix representations and basic algebra is pretty fundamental. > Discuss distributed linear algebra in spark.ml package > -- > > Key: SPARK-15882 > URL: https://issues.apache.org/jira/browse/SPARK-15882 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley >Priority: Major > > This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* > should be migrated to org.apache.spark.ml. > Initial questions: > * Should we use Datasets or RDDs underneath? > * If Datasets, are there missing features needed for the migration? > * Do we want to redesign any aspects of the distributed matrices during this > move? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package
[ https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503256#comment-16503256 ] Kyle Prifogle commented on SPARK-15882: --- Noticing this is almost 2 years old now which gives me the impression that this isn't going to be done? If I took the time to start to bite this off does anyone thing that they would use this or are most people finding their own more "expert" solutions :D ? Seems like matrix representations and basic algebra is pretty fundamental. > Discuss distributed linear algebra in spark.ml package > -- > > Key: SPARK-15882 > URL: https://issues.apache.org/jira/browse/SPARK-15882 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley >Priority: Major > > This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* > should be migrated to org.apache.spark.ml. > Initial questions: > * Should we use Datasets or RDDs underneath? > * If Datasets, are there missing features needed for the migration? > * Do we want to redesign any aspects of the distributed matrices during this > move? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24471) MlLib distributed plans
[ https://issues.apache.org/jira/browse/SPARK-24471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Prifogle resolved SPARK-24471. --- Resolution: Duplicate > MlLib distributed plans > --- > > Key: SPARK-24471 > URL: https://issues.apache.org/jira/browse/SPARK-24471 > Project: Spark > Issue Type: Question > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Kyle Prifogle >Priority: Major > > > I have found myself using MlLib's CoordinateMatrix and RowMatrix alot lately. > Since the new API is centered on Ml.linalg and MlLib is in maintenence mode > are there plans to move all the matrix components over to Ml.linalg? I dont > see a distributed package in the new one yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24471) MlLib distributed plans
[ https://issues.apache.org/jira/browse/SPARK-24471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503251#comment-16503251 ] Kyle Prifogle commented on SPARK-24471: --- This is what I was looking for: https://issues.apache.org/jira/browse/SPARK-15882 Also please remove the "Question" option from Jira if its not to be used. > MlLib distributed plans > --- > > Key: SPARK-24471 > URL: https://issues.apache.org/jira/browse/SPARK-24471 > Project: Spark > Issue Type: Question > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Kyle Prifogle >Priority: Major > > > I have found myself using MlLib's CoordinateMatrix and RowMatrix alot lately. > Since the new API is centered on Ml.linalg and MlLib is in maintenence mode > are there plans to move all the matrix components over to Ml.linalg? I dont > see a distributed package in the new one yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24474: - Description: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled * The stage that hangs (referred to as "Second stage" above) has a lower 'Stage Id' than the first one that completes was: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled > * The stag
[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503225#comment-16503225 ] Al M commented on SPARK-24474: -- I appreciate that 2.2.0 is slightly old but I couldn't see any scheduler fixes in later versions that sounded like this. > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24474: - Description: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled was: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503217#comment-16503217 ] Teng Peng commented on SPARK-24431: --- [~Ben2018] The article makes sense to me. It seems the current behavior follows "Case 2: TP is not 0", but it set precision = 1 if recall = 0. (See [https://github.com/apache/spark/blob/734ed7a7b397578f16549070f350215bde369b3c/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L110] ) Have you checked out SPARK-21806 and its discussion on JIRA? Let's hear [~srowen] 's opinions. > wrong areaUnderPR calculation in BinaryClassificationEvaluator > --- > > Key: SPARK-24431 > URL: https://issues.apache.org/jira/browse/SPARK-24431 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Xinyong Tian >Priority: Major > > My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., > evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) to > select best model. when the regParam in logistict regression is very high, no > variable will be selected (no model), ie every row 's prediction is same ,eg. > equal event rate (baseline frequency). But at this point, > BinaryClassificationEvaluator set the areaUnderPR highest. As a result best > model seleted is a no model. > the reason is following. at time of no model, precision recall curve will be > only two points: at recall =0, precision should be set to zero , while the > software set it to 1. at recall=1, precision is the event rate. As a result, > the areaUnderPR will be close 0.5 (my even rate is very low), which is > maximum . > the solution is to set precision =0 when recall =0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24474) Cores are left idle when there are a lot of stages to run
Al M created SPARK-24474: Summary: Cores are left idle when there are a lot of stages to run Key: SPARK-24474 URL: https://issues.apache.org/jira/browse/SPARK-24474 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.2.0 Reporter: Al M I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to other stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24474: - Description: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled was: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to other stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21687) Spark SQL should set createTime for Hive partition
[ https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503150#comment-16503150 ] Apache Spark commented on SPARK-21687: -- User 'debugger87' has created a pull request for this issue: https://github.com/apache/spark/pull/18900 > Spark SQL should set createTime for Hive partition > -- > > Key: SPARK-21687 > URL: https://issues.apache.org/jira/browse/SPARK-21687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Chaozhong Yang >Priority: Minor > > In Spark SQL, we often use `insert overwite table t partition(p=xx)` to > create partition for partitioned table. `createTime` is an important > information to manage data lifecycle, e.g TTL. > However, we found that Spark SQL doesn't call setCreateTime in > `HiveClientImpl#toHivePartition` as follows: > {code:scala} > def toHivePartition( > p: CatalogTablePartition, > ht: HiveTable): HivePartition = { > val tpart = new org.apache.hadoop.hive.metastore.api.Partition > val partValues = ht.getPartCols.asScala.map { hc => > p.spec.get(hc.getName).getOrElse { > throw new IllegalArgumentException( > s"Partition spec is missing a value for column '${hc.getName}': > ${p.spec}") > } > } > val storageDesc = new StorageDescriptor > val serdeInfo = new SerDeInfo > > p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation) > p.storage.inputFormat.foreach(storageDesc.setInputFormat) > p.storage.outputFormat.foreach(storageDesc.setOutputFormat) > p.storage.serde.foreach(serdeInfo.setSerializationLib) > serdeInfo.setParameters(p.storage.properties.asJava) > storageDesc.setSerdeInfo(serdeInfo) > tpart.setDbName(ht.getDbName) > tpart.setTableName(ht.getTableName) > tpart.setValues(partValues.asJava) > tpart.setSd(storageDesc) > new HivePartition(ht, tpart) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21687) Spark SQL should set createTime for Hive partition
[ https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21687: Assignee: (was: Apache Spark) > Spark SQL should set createTime for Hive partition > -- > > Key: SPARK-21687 > URL: https://issues.apache.org/jira/browse/SPARK-21687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Chaozhong Yang >Priority: Minor > > In Spark SQL, we often use `insert overwite table t partition(p=xx)` to > create partition for partitioned table. `createTime` is an important > information to manage data lifecycle, e.g TTL. > However, we found that Spark SQL doesn't call setCreateTime in > `HiveClientImpl#toHivePartition` as follows: > {code:scala} > def toHivePartition( > p: CatalogTablePartition, > ht: HiveTable): HivePartition = { > val tpart = new org.apache.hadoop.hive.metastore.api.Partition > val partValues = ht.getPartCols.asScala.map { hc => > p.spec.get(hc.getName).getOrElse { > throw new IllegalArgumentException( > s"Partition spec is missing a value for column '${hc.getName}': > ${p.spec}") > } > } > val storageDesc = new StorageDescriptor > val serdeInfo = new SerDeInfo > > p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation) > p.storage.inputFormat.foreach(storageDesc.setInputFormat) > p.storage.outputFormat.foreach(storageDesc.setOutputFormat) > p.storage.serde.foreach(serdeInfo.setSerializationLib) > serdeInfo.setParameters(p.storage.properties.asJava) > storageDesc.setSerdeInfo(serdeInfo) > tpart.setDbName(ht.getDbName) > tpart.setTableName(ht.getTableName) > tpart.setValues(partValues.asJava) > tpart.setSd(storageDesc) > new HivePartition(ht, tpart) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21687) Spark SQL should set createTime for Hive partition
[ https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21687: Assignee: Apache Spark > Spark SQL should set createTime for Hive partition > -- > > Key: SPARK-21687 > URL: https://issues.apache.org/jira/browse/SPARK-21687 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Chaozhong Yang >Assignee: Apache Spark >Priority: Minor > > In Spark SQL, we often use `insert overwite table t partition(p=xx)` to > create partition for partitioned table. `createTime` is an important > information to manage data lifecycle, e.g TTL. > However, we found that Spark SQL doesn't call setCreateTime in > `HiveClientImpl#toHivePartition` as follows: > {code:scala} > def toHivePartition( > p: CatalogTablePartition, > ht: HiveTable): HivePartition = { > val tpart = new org.apache.hadoop.hive.metastore.api.Partition > val partValues = ht.getPartCols.asScala.map { hc => > p.spec.get(hc.getName).getOrElse { > throw new IllegalArgumentException( > s"Partition spec is missing a value for column '${hc.getName}': > ${p.spec}") > } > } > val storageDesc = new StorageDescriptor > val serdeInfo = new SerDeInfo > > p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation) > p.storage.inputFormat.foreach(storageDesc.setInputFormat) > p.storage.outputFormat.foreach(storageDesc.setOutputFormat) > p.storage.serde.foreach(serdeInfo.setSerializationLib) > serdeInfo.setParameters(p.storage.properties.asJava) > storageDesc.setSerdeInfo(serdeInfo) > tpart.setDbName(ht.getDbName) > tpart.setTableName(ht.getTableName) > tpart.setValues(partValues.asJava) > tpart.setSd(storageDesc) > new HivePartition(ht, tpart) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them
[ https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503065#comment-16503065 ] Liang-Chi Hsieh edited comment on SPARK-24357 at 6/6/18 9:50 AM: - I think this is because this number {{1 << 65}} (36893488147419103232L) is more than Scala's long range. Scala: scala> Long.MaxValue res3: Long = 9223372036854775807 Python: >>> TEST_DATA = [Row(data=9223372036854775807L)] >>> frame = spark.createDataFrame(TEST_DATA) >>> frame.collect() [Row(data=9223372036854775807)] >>> TEST_DATA = [Row(data=9223372036854775808L)] >>> frame = spark.createDataFrame(TEST_DATA) >>> frame.collect() [Row(data=None)] was (Author: viirya): I think this is because this number {{1 << 65}} (36893488147419103232L) is more than Scala's long range. >>> TEST_DATA = [Row(data=9223372036854775807L)] >>> frame = spark.createDataFrame(TEST_DATA) >>> frame.collect() [Row(data=9223372036854775807)] >>> TEST_DATA = [Row(data=9223372036854775808L)] >>> frame = spark.createDataFrame(TEST_DATA) >>> frame.collect() [Row(data=None)] > createDataFrame in Python infers large integers as long type and then fails > silently when converting them > - > > Key: SPARK-24357 > URL: https://issues.apache.org/jira/browse/SPARK-24357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joel Croteau >Priority: Major > > When inferring the schema type of an RDD passed to createDataFrame, PySpark > SQL will infer any integral type as a LongType, which is a 64-bit integer, > without actually checking whether the values will fit into a 64-bit slot. If > the values are larger than 64 bits, then when pickled and unpickled in Java, > Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is > called, it will ignore the BigInteger type and return Null. This results in > any large integers in the resulting DataFrame being silently converted to > None. This can create some very surprising and difficult to debug behavior, > in particular if you are not aware of this limitation. There should either be > a runtime error at some point in this conversion chain, or else _infer_type > should infer larger integers as DecimalType with appropriate precision, or as > BinaryType. The former would be less convenient, but the latter may be > problematic to implement in practice. In any case, we should stop silently > converting large integers to None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them
[ https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503065#comment-16503065 ] Liang-Chi Hsieh commented on SPARK-24357: - I think this is because this number {{1 << 65}} (36893488147419103232L) is more than Scala's long range. >>> TEST_DATA = [Row(data=9223372036854775807L)] >>> frame = spark.createDataFrame(TEST_DATA) >>> frame.collect() [Row(data=9223372036854775807)] >>> TEST_DATA = [Row(data=9223372036854775808L)] >>> frame = spark.createDataFrame(TEST_DATA) >>> frame.collect() [Row(data=None)] > createDataFrame in Python infers large integers as long type and then fails > silently when converting them > - > > Key: SPARK-24357 > URL: https://issues.apache.org/jira/browse/SPARK-24357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joel Croteau >Priority: Major > > When inferring the schema type of an RDD passed to createDataFrame, PySpark > SQL will infer any integral type as a LongType, which is a 64-bit integer, > without actually checking whether the values will fit into a 64-bit slot. If > the values are larger than 64 bits, then when pickled and unpickled in Java, > Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is > called, it will ignore the BigInteger type and return Null. This results in > any large integers in the resulting DataFrame being silently converted to > None. This can create some very surprising and difficult to debug behavior, > in particular if you are not aware of this limitation. There should either be > a runtime error at some point in this conversion chain, or else _infer_type > should infer larger integers as DecimalType with appropriate precision, or as > BinaryType. The former would be less convenient, but the latter may be > problematic to implement in practice. In any case, we should stop silently > converting large integers to None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24473) It is no need to clip the predictive value by maxValue and minValue when computing gradient on SVDplusplus model
caijianming created SPARK-24473: --- Summary: It is no need to clip the predictive value by maxValue and minValue when computing gradient on SVDplusplus model Key: SPARK-24473 URL: https://issues.apache.org/jira/browse/SPARK-24473 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.3.0 Reporter: caijianming I think it is no need to clip the predictive value. It will change the convex loss function to non-convex, which might have a bad influence on convergence. Other famous recommender systems and original paper also do not include this step. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator
[ https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503002#comment-16503002 ] Liang-Chi Hsieh commented on SPARK-24467: - [~josephkb] Does that mean {{VectorAssembler}} will change from a {{Transformer}} to a {{Model}}? > VectorAssemblerEstimator > > > Key: SPARK-24467 > URL: https://issues.apache.org/jira/browse/SPARK-24467 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > In [SPARK-22346], I believe I made a wrong API decision: I recommended added > `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since > I thought the latter option would break most workflows. However, I should > have proposed: > * Add a Param to VectorAssembler for specifying the sizes of Vectors in the > inputCols. This Param can be optional. If not given, then VectorAssembler > will behave as it does now. If given, then VectorAssembler can use that info > instead of figuring out the Vector sizes via metadata or examining Rows in > the data (though it could do consistency checks). > * Add a VectorAssemblerEstimator which gets the Vector lengths from data and > produces a VectorAssembler with the vector lengths Param specified. > This will not break existing workflows. Migrating to > VectorAssemblerEstimator will be easier than adding VectorSizeHint since it > will not require users to manually input Vector lengths. > Note: Even with this Estimator, VectorSizeHint might prove useful for other > things in the future which require vector length metadata, so we could > consider keeping it rather than deprecating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15064) Locale support in StopWordsRemover
[ https://issues.apache.org/jira/browse/SPARK-15064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502929#comment-16502929 ] yuhao yang commented on SPARK-15064: Yuhao will be OOF from May 29th to June 6th (annual leave and conference). Please expect delayed email response. Conctact 669 243 8273 for anything urgent. Regards, Yuhao > Locale support in StopWordsRemover > -- > > Key: SPARK-15064 > URL: https://issues.apache.org/jira/browse/SPARK-15064 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We support case insensitive filtering (default) in StopWordsRemover. However, > case insensitive matching depends on the locale and region, which cannot be > explicitly set in StopWordsRemover. We should consider adding this support in > MLlib. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15064) Locale support in StopWordsRemover
[ https://issues.apache.org/jira/browse/SPARK-15064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502928#comment-16502928 ] Apache Spark commented on SPARK-15064: -- User 'dongjinleekr' has created a pull request for this issue: https://github.com/apache/spark/pull/21501 > Locale support in StopWordsRemover > -- > > Key: SPARK-15064 > URL: https://issues.apache.org/jira/browse/SPARK-15064 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We support case insensitive filtering (default) in StopWordsRemover. However, > case insensitive matching depends on the locale and region, which cannot be > explicitly set in StopWordsRemover. We should consider adding this support in > MLlib. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502914#comment-16502914 ] Shixiong Zhu commented on SPARK-24472: -- cc [~dongjoon] > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) >
[jira] [Closed] (SPARK-24455) fix typo in TaskSchedulerImpl's comments
[ https://issues.apache.org/jira/browse/SPARK-24455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xueyu closed SPARK-24455. - > fix typo in TaskSchedulerImpl's comments > > > Key: SPARK-24455 > URL: https://issues.apache.org/jira/browse/SPARK-24455 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: xueyu >Assignee: xueyu >Priority: Trivial > Fix For: 2.3.2, 2.4.0 > > > fix the method name in TaskSchedulerImpl.scala 's comments -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org