[jira] [Commented] (SPARK-22980) Wrong answer when using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315805#comment-16315805 ] Hyukjin Kwon commented on SPARK-22980: -- [~smilegator], May I ask why this was reopened and what I misunderstood? > Wrong answer when using pandas_udf > -- > > Key: SPARK-22980 > URL: https://issues.apache.org/jira/browse/SPARK-22980 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Blocker > > {noformat} > from pyspark.sql.functions import pandas_udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = pandas_udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > {noformat} > from pyspark.sql.functions import udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > The results of pandas_udf are different from udf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22989) sparkstreaming ui show 0 records when spark-streaming-kafka application restore from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-22989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoshijie updated SPARK-22989: --- Description: when a spark-streaming-kafka application restore from checkpoint , I find spark-streaming ui Each batch records is 0. !https://raw.githubusercontent.com/smdfj/picture/master/spark/batch.png! was: when a spark-streaming-kafka application restore from checkpoint , I find spark-streaming ui Each batch records is 0. !https://github.com/smdfj/picture/blob/master/spark/batch.png! > sparkstreaming ui show 0 records when spark-streaming-kafka application > restore from checkpoint > > > Key: SPARK-22989 > URL: https://issues.apache.org/jira/browse/SPARK-22989 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: zhaoshijie > > when a spark-streaming-kafka application restore from checkpoint , I find > spark-streaming ui Each batch records is 0. > !https://raw.githubusercontent.com/smdfj/picture/master/spark/batch.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22989) sparkstreaming ui show 0 records when spark-streaming-kafka application restore from checkpoint
zhaoshijie created SPARK-22989: -- Summary: sparkstreaming ui show 0 records when spark-streaming-kafka application restore from checkpoint Key: SPARK-22989 URL: https://issues.apache.org/jira/browse/SPARK-22989 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.2.0 Reporter: zhaoshijie when a spark-streaming-kafka application restore from checkpoint , I find spark-streaming ui Each batch records is 0. !https://github.com/smdfj/picture/blob/master/spark/batch.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22988) Why does dataset's unpersist clear all the caches have the same logical plan?
[ https://issues.apache.org/jira/browse/SPARK-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang Cheng updated SPARK-22988: --- Description: When I do followings: dataset A = some dataset A.persist dataset B = A.doSomthing dataset C = A.doSomthing C.persist A.unpersist I found C's cache is removed too, since following code: def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock { val it = cachedData.iterator() while (it.hasNext) { val cd = it.next() if (cd.plan.find(_.sameResult(plan)).isDefined) { cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking) it.remove() } } } It removes the data caches contain the same logical plan, should it only remove the cache whose dataset calls unpersist method? was: When I do followings: dataset A = some dataset A.persist dataset B = A.doSomthing dataset C = A.doSomthing C.persist A.unpersist I found C's cache is removed too, since following code: def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock { val it = cachedData.iterator() while (it.hasNext) { val cd = it.next() if (cd.plan.find(_.sameResult(plan)).isDefined) { cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking) it.remove() } } } It removes the data caches contain the some logical plan, should it only remove the cache whose dataset calls unpersist method? > Why does dataset's unpersist clear all the caches have the same logical plan? > - > > Key: SPARK-22988 > URL: https://issues.apache.org/jira/browse/SPARK-22988 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Wang Cheng >Priority: Minor > > When I do followings: > dataset A = some dataset > A.persist > dataset B = A.doSomthing > dataset C = A.doSomthing > C.persist > A.unpersist > I found C's cache is removed too, since following code: > def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): > Unit = writeLock { > val it = cachedData.iterator() > while (it.hasNext) { > val cd = it.next() > if (cd.plan.find(_.sameResult(plan)).isDefined) { > cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking) > it.remove() > } > } > } > It removes the data caches contain the same logical plan, should it only > remove the cache whose dataset calls unpersist method? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22980) Wrong answer when using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-22980: - > Wrong answer when using pandas_udf > -- > > Key: SPARK-22980 > URL: https://issues.apache.org/jira/browse/SPARK-22980 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Blocker > > {noformat} > from pyspark.sql.functions import pandas_udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = pandas_udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > {noformat} > from pyspark.sql.functions import udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > The results of pandas_udf are different from udf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22988) Why does dataset's unpersist clear all the caches have the same logical plan?
[ https://issues.apache.org/jira/browse/SPARK-22988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang Cheng updated SPARK-22988: --- Description: When I do followings: dataset A = some dataset A.persist dataset B = A.doSomthing dataset C = A.doSomthing C.persist A.unpersist I found C's cache is removed too, since following code: def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock { val it = cachedData.iterator() while (it.hasNext) { val cd = it.next() if (cd.plan.find(_.sameResult(plan)).isDefined) { cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking) it.remove() } } } It removes the data caches contain the some logical plan, should it only remove the cache whose dataset calls unpersist method? was: When I do followings: dataset A = some dataset A.persist dataset B = A.doSomthing dataset C = A.doSomthing C.persist A.unpersist I found C's cache is removed too, since following code: def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock { val it = cachedData.iterator() while (it.hasNext) { val cd = it.next() if (cd.plan.find(_.sameResult(plan)).isDefined) { cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking) it.remove() } } } It removes the data caches contains the some logical plan, should it only remove the cache whose dataset calls unpersist method? > Why does dataset's unpersist clear all the caches have the same logical plan? > - > > Key: SPARK-22988 > URL: https://issues.apache.org/jira/browse/SPARK-22988 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Wang Cheng >Priority: Minor > > When I do followings: > dataset A = some dataset > A.persist > dataset B = A.doSomthing > dataset C = A.doSomthing > C.persist > A.unpersist > I found C's cache is removed too, since following code: > def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): > Unit = writeLock { > val it = cachedData.iterator() > while (it.hasNext) { > val cd = it.next() > if (cd.plan.find(_.sameResult(plan)).isDefined) { > cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking) > it.remove() > } > } > } > It removes the data caches contain the some logical plan, should it only > remove the cache whose dataset calls unpersist method? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22988) Why does dataset's unpersist clear all the caches have the same logical plan?
Wang Cheng created SPARK-22988: -- Summary: Why does dataset's unpersist clear all the caches have the same logical plan? Key: SPARK-22988 URL: https://issues.apache.org/jira/browse/SPARK-22988 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.1.1 Reporter: Wang Cheng Priority: Minor When I do followings: dataset A = some dataset A.persist dataset B = A.doSomthing dataset C = A.doSomthing C.persist A.unpersist I found C's cache is removed too, since following code: def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: Boolean): Unit = writeLock { val it = cachedData.iterator() while (it.hasNext) { val cd = it.next() if (cd.plan.find(_.sameResult(plan)).isDefined) { cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking) it.remove() } } } It removes the data caches contains the some logical plan, should it only remove the cache whose dataset calls unpersist method? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22979) Avoid per-record type dispatch in Python data conversion (EvaluatePython.fromJava)
[ https://issues.apache.org/jira/browse/SPARK-22979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22979. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.3.0 > Avoid per-record type dispatch in Python data conversion > (EvaluatePython.fromJava) > -- > > Key: SPARK-22979 > URL: https://issues.apache.org/jira/browse/SPARK-22979 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.3.0 > > > Seems we are type dispatching between Java objects (from Pyrolite) to Spark's > internal data format. > See > https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L89-L162 > Looks we can make converters each for each type and then reuse it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion
[ https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-22566. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19792 [https://github.com/apache/spark/pull/19792] > Better error message for `_merge_type` in Pandas to Spark DF conversion > --- > > Key: SPARK-22566 > URL: https://issues.apache.org/jira/browse/SPARK-22566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Guilherme Berger >Assignee: Guilherme Berger >Priority: Minor > Fix For: 2.3.0 > > > When creating a Spark DF from a Pandas DF without specifying a schema, schema > inference is used. This inference can fail when a column contains values of > two different types; this is ok. The problem is the error message does not > tell us in which column this happened. > When this happens, it is painful to debug since the error message is too > vague. > I plan on submitting a PR which fixes this, providing a better error message > for such cases, containing the column name (and possibly the problematic > values too). > >>> spark_session.createDataFrame(pandas_df) > File "redacted/pyspark/sql/session.py", line 541, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal > struct = self._inferSchemaFromList(data) > File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList > schema = reduce(_merge_type, map(_infer_schema, data)) > File "redacted/pyspark/sql/types.py", line 1124, in _merge_type > for f in a.fields] > File "redacted/pyspark/sql/types.py", line 1118, in _merge_type > raise TypeError("Can not merge type %s and %s" % (type(a), type(b))) > TypeError: Can not merge type and 'pyspark.sql.types.StringType'> > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22566) Better error message for `_merge_type` in Pandas to Spark DF conversion
[ https://issues.apache.org/jira/browse/SPARK-22566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-22566: - Assignee: Guilherme Berger > Better error message for `_merge_type` in Pandas to Spark DF conversion > --- > > Key: SPARK-22566 > URL: https://issues.apache.org/jira/browse/SPARK-22566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Guilherme Berger >Assignee: Guilherme Berger >Priority: Minor > > When creating a Spark DF from a Pandas DF without specifying a schema, schema > inference is used. This inference can fail when a column contains values of > two different types; this is ok. The problem is the error message does not > tell us in which column this happened. > When this happens, it is painful to debug since the error message is too > vague. > I plan on submitting a PR which fixes this, providing a better error message > for such cases, containing the column name (and possibly the problematic > values too). > >>> spark_session.createDataFrame(pandas_df) > File "redacted/pyspark/sql/session.py", line 541, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "redacted/pyspark/sql/session.py", line 401, in _createFromLocal > struct = self._inferSchemaFromList(data) > File "redacted/pyspark/sql/session.py", line 333, in _inferSchemaFromList > schema = reduce(_merge_type, map(_infer_schema, data)) > File "redacted/pyspark/sql/types.py", line 1124, in _merge_type > for f in a.fields] > File "redacted/pyspark/sql/types.py", line 1118, in _merge_type > raise TypeError("Can not merge type %s and %s" % (type(a), type(b))) > TypeError: Can not merge type and 'pyspark.sql.types.StringType'> > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22987) UnsafeExternalSorter cases OOM when invoking `getIterator` function.
[ https://issues.apache.org/jira/browse/SPARK-22987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315657#comment-16315657 ] Apache Spark commented on SPARK-22987: -- User 'liutang123' has created a pull request for this issue: https://github.com/apache/spark/pull/20184 > UnsafeExternalSorter cases OOM when invoking `getIterator` function. > > > Key: SPARK-22987 > URL: https://issues.apache.org/jira/browse/SPARK-22987 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Lijia Liu > > ChainedIterator.UnsafeExternalSorter remains a Queue of UnsafeSorterIterator. > When call `getIterator` function of UnsafeExternalSorter, > UnsafeExternalSorter passes an ArrayList of UnsafeSorterSpillReader to the > constructor of UnsafeExternalSorter. But, UnsafeSorterSpillReader maintains a > byte array as buffer, witch capacity is more than 1 MB. When spilling > frequently, this case maybe causes OOM. > I try to change the Queue in ChainedIterator to a Iterator. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22987) UnsafeExternalSorter cases OOM when invoking `getIterator` function.
[ https://issues.apache.org/jira/browse/SPARK-22987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22987: Assignee: (was: Apache Spark) > UnsafeExternalSorter cases OOM when invoking `getIterator` function. > > > Key: SPARK-22987 > URL: https://issues.apache.org/jira/browse/SPARK-22987 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Lijia Liu > > ChainedIterator.UnsafeExternalSorter remains a Queue of UnsafeSorterIterator. > When call `getIterator` function of UnsafeExternalSorter, > UnsafeExternalSorter passes an ArrayList of UnsafeSorterSpillReader to the > constructor of UnsafeExternalSorter. But, UnsafeSorterSpillReader maintains a > byte array as buffer, witch capacity is more than 1 MB. When spilling > frequently, this case maybe causes OOM. > I try to change the Queue in ChainedIterator to a Iterator. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22987) UnsafeExternalSorter cases OOM when invoking `getIterator` function.
[ https://issues.apache.org/jira/browse/SPARK-22987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22987: Assignee: Apache Spark > UnsafeExternalSorter cases OOM when invoking `getIterator` function. > > > Key: SPARK-22987 > URL: https://issues.apache.org/jira/browse/SPARK-22987 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Lijia Liu >Assignee: Apache Spark > > ChainedIterator.UnsafeExternalSorter remains a Queue of UnsafeSorterIterator. > When call `getIterator` function of UnsafeExternalSorter, > UnsafeExternalSorter passes an ArrayList of UnsafeSorterSpillReader to the > constructor of UnsafeExternalSorter. But, UnsafeSorterSpillReader maintains a > byte array as buffer, witch capacity is more than 1 MB. When spilling > frequently, this case maybe causes OOM. > I try to change the Queue in ChainedIterator to a Iterator. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22987) UnsafeExternalSorter cases OOM when invoking `getIterator` function.
Lijia Liu created SPARK-22987: - Summary: UnsafeExternalSorter cases OOM when invoking `getIterator` function. Key: SPARK-22987 URL: https://issues.apache.org/jira/browse/SPARK-22987 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.1, 2.2.0 Reporter: Lijia Liu ChainedIterator.UnsafeExternalSorter remains a Queue of UnsafeSorterIterator. When call `getIterator` function of UnsafeExternalSorter, UnsafeExternalSorter passes an ArrayList of UnsafeSorterSpillReader to the constructor of UnsafeExternalSorter. But, UnsafeSorterSpillReader maintains a byte array as buffer, witch capacity is more than 1 MB. When spilling frequently, this case maybe causes OOM. I try to change the Queue in ChainedIterator to a Iterator. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315648#comment-16315648 ] Prateek commented on SPARK-22711: - @ [~hyukjin.kwon] Code updated. Make sure you have downloaded nltk, networkx, punkt, averaged_perceptron_tagger,wordnet > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0, 2.2.1 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek > Attachments: Jira_Spark_minimized_code.py > > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 4
[jira] [Updated] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prateek updated SPARK-22711: Attachment: Jira_Spark_minimized_code.py updated missing code > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0, 2.2.1 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek > Attachments: Jira_Spark_minimized_code.py > > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py",
[jira] [Updated] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py
[ https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prateek updated SPARK-22711: Attachment: (was: Jira_Spark_minimized_code.py) > _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from > cloudpickle.py > --- > > Key: SPARK-22711 > URL: https://issues.apache.org/jira/browse/SPARK-22711 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.2.0, 2.2.1 > Environment: Ubuntu pseudo distributed installation of Spark 2.2.0 >Reporter: Prateek > Original Estimate: 336h > Remaining Estimate: 336h > > When I submit a Pyspark program with spark-submit command this error is > thrown. > It happens when for code like below > RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v) > or > RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v) > Traceback (most recent call last): > File "/home/prateek/Project/textrank.py", line 299, in > summaryRDD = sentenceTokensReduceRDD.map(lambda m: > get_summary(m)).reduceByKey(lambda c,v :c+v) > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, > in reduceByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, > in combineByKey > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, > in partitionBy > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, > in _jrdd > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, > in _wrap_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, > in _prepare_for_python_RDD > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 460, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 704, in dumps > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 148, in dump > File "/usr/lib/python3.5/pickle.py", line 408, in dump > self.save(obj) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends > save(x) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 255, in save_function > File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line > 292, in save_function_tuple > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple > save(element) > File "/usr/lib/python3.5/pickle.py", line 475, in save > f(self, obj) # Call unbound method with explicit self > File "/usr/lib/python3.5/pickle.py", line 770, in save_list > self._batch_appends(obj) > File "/
[jira] [Resolved] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
[ https://issues.apache.org/jira/browse/SPARK-22985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22985. - Resolution: Fixed Fix Version/s: 2.3.0 > Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen > -- > > Key: SPARK-22985 > URL: https://issues.apache.org/jira/browse/SPARK-22985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.3.0 > > > The from_utc_timestamp and to_utc_timestamp expressions do not properly > escape their timezone argument in codegen, leading to compilation errors > instead of analysis errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22986) Avoid instantiating multiple instances of broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-22986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315615#comment-16315615 ] Apache Spark commented on SPARK-22986: -- User 'ho3rexqj' has created a pull request for this issue: https://github.com/apache/spark/pull/20183 > Avoid instantiating multiple instances of broadcast variables > -- > > Key: SPARK-22986 > URL: https://issues.apache.org/jira/browse/SPARK-22986 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: ho3rexqj > > When resources happen to be constrained on an executor the first time a > broadcast variable is instantiated it is persisted to disk by the > BlockManager. Consequently, every subsequent call to > TorrentBroadcast::readBroadcastBlock from other instances of that broadcast > variable spawns another instance of the underlying value. That is, broadcast > variables are spawned once per executor *unless* memory is constrained, in > which case every instance of a broadcast variable is provided with a unique > copy of the underlying value. > The fix I propose is to explicitly cache the underlying values using weak > references (in a ReferenceMap) - note, however, that I couldn't find a clean > approach to creating the cache container here. I added that to > BroadcastManager as a package-private field for want of a better solution, > however if something more appropriate already exists in the project for that > purpose please let me know. > The above issue was terminating our team's applications erratically - > effectively, we were distributing roughly 1 GiB of data through a broadcast > variable and under certain conditions memory was constrained the first time > the broadcast variable was loaded on an executor. As such, the executor > attempted to spawn several additional copies of the broadcast variable (we > were using 8 worker threads on the executor) which quickly led to the task > failing as the result of an OOM exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22986) Avoid instantiating multiple instances of broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-22986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22986: Assignee: Apache Spark > Avoid instantiating multiple instances of broadcast variables > -- > > Key: SPARK-22986 > URL: https://issues.apache.org/jira/browse/SPARK-22986 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: ho3rexqj >Assignee: Apache Spark > > When resources happen to be constrained on an executor the first time a > broadcast variable is instantiated it is persisted to disk by the > BlockManager. Consequently, every subsequent call to > TorrentBroadcast::readBroadcastBlock from other instances of that broadcast > variable spawns another instance of the underlying value. That is, broadcast > variables are spawned once per executor *unless* memory is constrained, in > which case every instance of a broadcast variable is provided with a unique > copy of the underlying value. > The fix I propose is to explicitly cache the underlying values using weak > references (in a ReferenceMap) - note, however, that I couldn't find a clean > approach to creating the cache container here. I added that to > BroadcastManager as a package-private field for want of a better solution, > however if something more appropriate already exists in the project for that > purpose please let me know. > The above issue was terminating our team's applications erratically - > effectively, we were distributing roughly 1 GiB of data through a broadcast > variable and under certain conditions memory was constrained the first time > the broadcast variable was loaded on an executor. As such, the executor > attempted to spawn several additional copies of the broadcast variable (we > were using 8 worker threads on the executor) which quickly led to the task > failing as the result of an OOM exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22986) Avoid instantiating multiple instances of broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-22986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22986: Assignee: (was: Apache Spark) > Avoid instantiating multiple instances of broadcast variables > -- > > Key: SPARK-22986 > URL: https://issues.apache.org/jira/browse/SPARK-22986 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: ho3rexqj > > When resources happen to be constrained on an executor the first time a > broadcast variable is instantiated it is persisted to disk by the > BlockManager. Consequently, every subsequent call to > TorrentBroadcast::readBroadcastBlock from other instances of that broadcast > variable spawns another instance of the underlying value. That is, broadcast > variables are spawned once per executor *unless* memory is constrained, in > which case every instance of a broadcast variable is provided with a unique > copy of the underlying value. > The fix I propose is to explicitly cache the underlying values using weak > references (in a ReferenceMap) - note, however, that I couldn't find a clean > approach to creating the cache container here. I added that to > BroadcastManager as a package-private field for want of a better solution, > however if something more appropriate already exists in the project for that > purpose please let me know. > The above issue was terminating our team's applications erratically - > effectively, we were distributing roughly 1 GiB of data through a broadcast > variable and under certain conditions memory was constrained the first time > the broadcast variable was loaded on an executor. As such, the executor > attempted to spawn several additional copies of the broadcast variable (we > were using 8 worker threads on the executor) which quickly led to the task > failing as the result of an OOM exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22986) Avoid instantiating multiple instances of broadcast variables
ho3rexqj created SPARK-22986: Summary: Avoid instantiating multiple instances of broadcast variables Key: SPARK-22986 URL: https://issues.apache.org/jira/browse/SPARK-22986 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: ho3rexqj When resources happen to be constrained on an executor the first time a broadcast variable is instantiated it is persisted to disk by the BlockManager. Consequently, every subsequent call to TorrentBroadcast::readBroadcastBlock from other instances of that broadcast variable spawns another instance of the underlying value. That is, broadcast variables are spawned once per executor *unless* memory is constrained, in which case every instance of a broadcast variable is provided with a unique copy of the underlying value. The fix I propose is to explicitly cache the underlying values using weak references (in a ReferenceMap) - note, however, that I couldn't find a clean approach to creating the cache container here. I added that to BroadcastManager as a package-private field for want of a better solution, however if something more appropriate already exists in the project for that purpose please let me know. The above issue was terminating our team's applications erratically - effectively, we were distributing roughly 1 GiB of data through a broadcast variable and under certain conditions memory was constrained the first time the broadcast variable was loaded on an executor. As such, the executor attempted to spawn several additional copies of the broadcast variable (we were using 8 worker threads on the executor) which quickly led to the task failing as the result of an OOM exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jepson updated SPARK-22968: --- Component/s: (was: Structured Streaming) Spark Core > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.streaming.
[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315552#comment-16315552 ] Jepson commented on SPARK-22968: [~srowen] Thanks for quick response. I turn up the parameter "max.partition.fetch.bytes" from 5242880 to10485760. In the later days, i'll look at it again. > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.sched
[jira] [Commented] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()
[ https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315547#comment-16315547 ] Hyukjin Kwon commented on SPARK-22967: -- Ah, I meant to fix the tests to use URI forms instead of Windows file path if they are not specific to test the path forms themselves. > VersionSuite failed on Windows caused by unescapeSQLString() > > > Key: SPARK-22967 > URL: https://issues.apache.org/jira/browse/SPARK-22967 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Windos7 >Reporter: wuyi >Priority: Minor > Labels: build, test, windows > > On Windows system, two unit test case would fail while running VersionSuite > ("A simple set of tests that call the methods of a `HiveClient`, loading > different version of hive from maven central.") > Failed A : test(s"$version: read avro file containing decimal") > {code:java} > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.lang.IllegalArgumentException: Can not create a > Path from an empty string); > {code} > Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table") > {code:java} > Unable to infer the schema. The schema specification is required to create > the table `default`.`tab2`.; > org.apache.spark.sql.AnalysisException: Unable to infer the schema. The > schema specification is required to create the table `default`.`tab2`.; > {code} > As I deep into this problem, I found it is related to > ParserUtils#unescapeSQLString(). > These are two lines at the beginning of Failed A: > {code:java} > val url = > Thread.currentThread().getContextClassLoader.getResource("avroDecimal") > val location = new File(url.getFile) > {code} > And in my environment,`location` (path value) is > {code:java} > D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal > {code} > And then, in SparkSqlParser#visitCreateHiveTable()#L1128: > {code:java} > val location = Option(ctx.locationSpec).map(visitLocationSpec) > {code} > This line want to get LocationSepcContext's content first, which is equal to > `location` above. > Then, the content is passed to visitLocationSpec(), and passed to > unescapeSQLString() > finally. > Lets' have a look at unescapeSQLString(): > {code:java} > /** Unescape baskslash-escaped string enclosed by quotes. */ > def unescapeSQLString(b: String): String = { > var enclosure: Character = null > val sb = new StringBuilder(b.length()) > def appendEscapedChar(n: Char) { > n match { > case '0' => sb.append('\u') > case '\'' => sb.append('\'') > case '"' => sb.append('\"') > case 'b' => sb.append('\b') > case 'n' => sb.append('\n') > case 'r' => sb.append('\r') > case 't' => sb.append('\t') > case 'Z' => sb.append('\u001A') > case '\\' => sb.append('\\') > // The following 2 lines are exactly what MySQL does TODO: why do we > do this? > case '%' => sb.append("\\%") > case '_' => sb.append("\\_") > case _ => sb.append(n) > } > } > var i = 0 > val strLength = b.length > while (i < strLength) { > val currentChar = b.charAt(i) > if (enclosure == null) { > if (currentChar == '\'' || currentChar == '\"') { > enclosure = currentChar > } > } else if (enclosure == currentChar) { > enclosure = null > } else if (currentChar == '\\') { > if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') { > // \u style character literals. > val base = i + 2 > val code = (0 until 4).foldLeft(0) { (mid, j) => > val digit = Character.digit(b.charAt(j + base), 16) > (mid << 4) + digit > } > sb.append(code.asInstanceOf[Char]) > i += 5 > } else if (i + 4 < strLength) { > // \000 style character literals. > val i1 = b.charAt(i + 1) > val i2 = b.charAt(i + 2) > val i3 = b.charAt(i + 3) > if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= > '0' && i3 <= '7')) { > val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << > 6)).asInstanceOf[Char] > sb.append(tmp) > i += 3 > } else { > appendEscapedChar(i1) > i += 1 > } > } else if (i + 2 < strLength) { > // escaped character literals. > val n = b.charAt(i + 1) > appendEscapedChar(n) > i += 1 > } > } else { > // non-escaped character literals. > sb.append(currentChar) > } > i += 1 > } > sb.toString(
[jira] [Assigned] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
[ https://issues.apache.org/jira/browse/SPARK-22985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22985: Assignee: Apache Spark (was: Josh Rosen) > Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen > -- > > Key: SPARK-22985 > URL: https://issues.apache.org/jira/browse/SPARK-22985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > The from_utc_timestamp and to_utc_timestamp expressions do not properly > escape their timezone argument in codegen, leading to compilation errors > instead of analysis errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
[ https://issues.apache.org/jira/browse/SPARK-22985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22985: Assignee: Josh Rosen (was: Apache Spark) > Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen > -- > > Key: SPARK-22985 > URL: https://issues.apache.org/jira/browse/SPARK-22985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The from_utc_timestamp and to_utc_timestamp expressions do not properly > escape their timezone argument in codegen, leading to compilation errors > instead of analysis errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
[ https://issues.apache.org/jira/browse/SPARK-22985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315530#comment-16315530 ] Apache Spark commented on SPARK-22985: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/20182 > Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen > -- > > Key: SPARK-22985 > URL: https://issues.apache.org/jira/browse/SPARK-22985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The from_utc_timestamp and to_utc_timestamp expressions do not properly > escape their timezone argument in codegen, leading to compilation errors > instead of analysis errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22985) Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen
Josh Rosen created SPARK-22985: -- Summary: Fix argument escaping bug in from_utc_timestamp / to_utc_timestamp codegen Key: SPARK-22985 URL: https://issues.apache.org/jira/browse/SPARK-22985 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.0, 2.0.0, 2.3.0 Reporter: Josh Rosen Assignee: Josh Rosen The from_utc_timestamp and to_utc_timestamp expressions do not properly escape their timezone argument in codegen, leading to compilation errors instead of analysis errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22984) Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-22984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22984: Assignee: Josh Rosen (was: Apache Spark) > Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner > --- > > Key: SPARK-22984 > URL: https://issues.apache.org/jira/browse/SPARK-22984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Labels: correctness > > The following query returns an incorrect answer: > {code} > set spark.sql.autoBroadcastJoinThreshold=-1; > create table a as select * from values 1; > create table b as select * from values 2; > SELECT > t3.col1, > t1.col1 > FROM a t1 > CROSS JOIN b t2 > CROSS JOIN b t3 > {code} > This should return the row {{2, 1}} but instead it returns {{null, 1}}. If > you permute the order of the columns in the select statement or the order of > the joins then it returns a valid answer (i.e. one without incorrect NULLs). > This turns out to be due to two longstanding bugs in GenerateUnsafeRowJoiner, > which I'll describe in more detail in my PR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22984) Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-22984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315520#comment-16315520 ] Apache Spark commented on SPARK-22984: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/20181 > Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner > --- > > Key: SPARK-22984 > URL: https://issues.apache.org/jira/browse/SPARK-22984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Labels: correctness > > The following query returns an incorrect answer: > {code} > set spark.sql.autoBroadcastJoinThreshold=-1; > create table a as select * from values 1; > create table b as select * from values 2; > SELECT > t3.col1, > t1.col1 > FROM a t1 > CROSS JOIN b t2 > CROSS JOIN b t3 > {code} > This should return the row {{2, 1}} but instead it returns {{null, 1}}. If > you permute the order of the columns in the select statement or the order of > the joins then it returns a valid answer (i.e. one without incorrect NULLs). > This turns out to be due to two longstanding bugs in GenerateUnsafeRowJoiner, > which I'll describe in more detail in my PR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22984) Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-22984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22984: Assignee: Apache Spark (was: Josh Rosen) > Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner > --- > > Key: SPARK-22984 > URL: https://issues.apache.org/jira/browse/SPARK-22984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Critical > Labels: correctness > > The following query returns an incorrect answer: > {code} > set spark.sql.autoBroadcastJoinThreshold=-1; > create table a as select * from values 1; > create table b as select * from values 2; > SELECT > t3.col1, > t1.col1 > FROM a t1 > CROSS JOIN b t2 > CROSS JOIN b t3 > {code} > This should return the row {{2, 1}} but instead it returns {{null, 1}}. If > you permute the order of the columns in the select statement or the order of > the joins then it returns a valid answer (i.e. one without incorrect NULLs). > This turns out to be due to two longstanding bugs in GenerateUnsafeRowJoiner, > which I'll describe in more detail in my PR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22984) Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner
Josh Rosen created SPARK-22984: -- Summary: Fix incorrect bitmap copying and offset shifting in GenerateUnsafeRowJoiner Key: SPARK-22984 URL: https://issues.apache.org/jira/browse/SPARK-22984 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0, 1.5.0, 2.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical The following query returns an incorrect answer: {code} set spark.sql.autoBroadcastJoinThreshold=-1; create table a as select * from values 1; create table b as select * from values 2; SELECT t3.col1, t1.col1 FROM a t1 CROSS JOIN b t2 CROSS JOIN b t3 {code} This should return the row {{2, 1}} but instead it returns {{null, 1}}. If you permute the order of the columns in the select statement or the order of the joins then it returns a valid answer (i.e. one without incorrect NULLs). This turns out to be due to two longstanding bugs in GenerateUnsafeRowJoiner, which I'll describe in more detail in my PR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22983) Don't push filters beneath aggregates with empty grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-22983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22983: Assignee: Apache Spark (was: Josh Rosen) > Don't push filters beneath aggregates with empty grouping expressions > - > > Key: SPARK-22983 > URL: https://issues.apache.org/jira/browse/SPARK-22983 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Critical > Labels: correctness > > The following SQL query should return zero rows, but in Spark it actually > returns one row: > {code} > SELECT 1 from ( > SELECT 1 AS z, > MIN(a.x) > FROM (select 1 as x) a > WHERE false > ) b > where b.z != b.z > {code} > The problem stems from the `PushDownPredicate` rule: when this rule > encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, > it removes the original filter and adds a new filter onto Aggregate's child, > e.g. `Agg(Filter(...))`. This is often okay, but the case above is a > counterexample: because there is no explicit `GROUP BY`, we are implicitly > computing a global aggregate over the entire table so the original filter was > not acting like a `HAVING` clause filtering the number of groups: if we push > this filter then it fails to actually reduce the cardinality of the Aggregate > output, leading to the wrong answer. > A simple fix is to never push down filters beneath aggregates when there are > no grouping expressions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22983) Don't push filters beneath aggregates with empty grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-22983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22983: Assignee: Josh Rosen (was: Apache Spark) > Don't push filters beneath aggregates with empty grouping expressions > - > > Key: SPARK-22983 > URL: https://issues.apache.org/jira/browse/SPARK-22983 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Labels: correctness > > The following SQL query should return zero rows, but in Spark it actually > returns one row: > {code} > SELECT 1 from ( > SELECT 1 AS z, > MIN(a.x) > FROM (select 1 as x) a > WHERE false > ) b > where b.z != b.z > {code} > The problem stems from the `PushDownPredicate` rule: when this rule > encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, > it removes the original filter and adds a new filter onto Aggregate's child, > e.g. `Agg(Filter(...))`. This is often okay, but the case above is a > counterexample: because there is no explicit `GROUP BY`, we are implicitly > computing a global aggregate over the entire table so the original filter was > not acting like a `HAVING` clause filtering the number of groups: if we push > this filter then it fails to actually reduce the cardinality of the Aggregate > output, leading to the wrong answer. > A simple fix is to never push down filters beneath aggregates when there are > no grouping expressions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22983) Don't push filters beneath aggregates with empty grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-22983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315517#comment-16315517 ] Apache Spark commented on SPARK-22983: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/20180 > Don't push filters beneath aggregates with empty grouping expressions > - > > Key: SPARK-22983 > URL: https://issues.apache.org/jira/browse/SPARK-22983 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Labels: correctness > > The following SQL query should return zero rows, but in Spark it actually > returns one row: > {code} > SELECT 1 from ( > SELECT 1 AS z, > MIN(a.x) > FROM (select 1 as x) a > WHERE false > ) b > where b.z != b.z > {code} > The problem stems from the `PushDownPredicate` rule: when this rule > encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, > it removes the original filter and adds a new filter onto Aggregate's child, > e.g. `Agg(Filter(...))`. This is often okay, but the case above is a > counterexample: because there is no explicit `GROUP BY`, we are implicitly > computing a global aggregate over the entire table so the original filter was > not acting like a `HAVING` clause filtering the number of groups: if we push > this filter then it fails to actually reduce the cardinality of the Aggregate > output, leading to the wrong answer. > A simple fix is to never push down filters beneath aggregates when there are > no grouping expressions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22983) Don't push filters beneath aggregates with empty grouping expressions
Josh Rosen created SPARK-22983: -- Summary: Don't push filters beneath aggregates with empty grouping expressions Key: SPARK-22983 URL: https://issues.apache.org/jira/browse/SPARK-22983 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.0, 2.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical The following SQL query should return zero rows, but in Spark it actually returns one row: {code} SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z {code} The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is often okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. A simple fix is to never push down filters beneath aggregates when there are no grouping expressions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
[ https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22982: Assignee: Apache Spark (was: Josh Rosen) > Remove unsafe asynchronous close() call from FileDownloadChannel > > > Key: SPARK-22982 > URL: https://issues.apache.org/jira/browse/SPARK-22982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > Labels: correctness > > Spark's Netty-based file transfer code contains an asynchronous IO bug which > may lead to incorrect query results. > At a high-level, the problem is that an unsafe asynchronous `close()` of a > pipe's source channel creates a race condition where file transfer code > closes a file descriptor then attempts to read from it. If the closed file > descriptor's number has been reused by an `open()` call then this invalid > read may cause unrelated file operations to return incorrect results due to > reading different data than intended. > I have a small, surgical fix for this bug and will submit a PR with more > description on the specific race condition / underlying bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
[ https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22982: Assignee: Josh Rosen (was: Apache Spark) > Remove unsafe asynchronous close() call from FileDownloadChannel > > > Key: SPARK-22982 > URL: https://issues.apache.org/jira/browse/SPARK-22982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Labels: correctness > > Spark's Netty-based file transfer code contains an asynchronous IO bug which > may lead to incorrect query results. > At a high-level, the problem is that an unsafe asynchronous `close()` of a > pipe's source channel creates a race condition where file transfer code > closes a file descriptor then attempts to read from it. If the closed file > descriptor's number has been reused by an `open()` call then this invalid > read may cause unrelated file operations to return incorrect results due to > reading different data than intended. > I have a small, surgical fix for this bug and will submit a PR with more > description on the specific race condition / underlying bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
[ https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315514#comment-16315514 ] Apache Spark commented on SPARK-22982: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/20179 > Remove unsafe asynchronous close() call from FileDownloadChannel > > > Key: SPARK-22982 > URL: https://issues.apache.org/jira/browse/SPARK-22982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Labels: correctness > > Spark's Netty-based file transfer code contains an asynchronous IO bug which > may lead to incorrect query results. > At a high-level, the problem is that an unsafe asynchronous `close()` of a > pipe's source channel creates a race condition where file transfer code > closes a file descriptor then attempts to read from it. If the closed file > descriptor's number has been reused by an `open()` call then this invalid > read may cause unrelated file operations to return incorrect results due to > reading different data than intended. > I have a small, surgical fix for this bug and will submit a PR with more > description on the specific race condition / underlying bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
Josh Rosen created SPARK-22982: -- Summary: Remove unsafe asynchronous close() call from FileDownloadChannel Key: SPARK-22982 URL: https://issues.apache.org/jira/browse/SPARK-22982 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Spark's Netty-based file transfer code contains an asynchronous IO bug which may lead to incorrect query results. At a high-level, the problem is that an unsafe asynchronous `close()` of a pipe's source channel creates a race condition where file transfer code closes a file descriptor then attempts to read from it. If the closed file descriptor's number has been reused by an `open()` call then this invalid read may cause unrelated file operations to return incorrect results due to reading different data than intended. I have a small, surgical fix for this bug and will submit a PR with more description on the specific race condition / underlying bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22980) Wrong answer when using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22980. -- Resolution: Not A Problem Please reopen this if I misunderstood. I am taking an action quick because it's set to a blocker. > Wrong answer when using pandas_udf > -- > > Key: SPARK-22980 > URL: https://issues.apache.org/jira/browse/SPARK-22980 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Blocker > > {noformat} > from pyspark.sql.functions import pandas_udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = pandas_udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > {noformat} > from pyspark.sql.functions import udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > The results of pandas_udf are different from udf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",
[ https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315438#comment-16315438 ] Felix Cheung commented on SPARK-22918: -- [~sameerag] we might want to check this for 2.3.0 release > sbt test (spark - local) fail after upgrading to 2.2.1 with: > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > > > Key: SPARK-22918 > URL: https://issues.apache.org/jira/browse/SPARK-22918 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Damian Momot > > After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started > to fail with following exception: > {noformat} > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > at > java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) > at > java.security.AccessController.checkPermission(AccessController.java:884) > at > org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown > Source) > at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown > Source) > at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at > org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) > at > org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) > at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java
[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone
[ https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315436#comment-16315436 ] Felix Cheung commented on SPARK-22632: -- yes, first I'd agree we should generalize this to R & Python second, I think in general the different treatment of timezone between language and Spark has been a source of confusion (has been reported at least a few times) lastly, this isn't a regression AFAIK, so not necessarily a blocker for 2.3, although might be very good to have. > Fix the behavior of timestamp values for R's DataFrame to respect session > timezone > -- > > Key: SPARK-22632 > URL: https://issues.apache.org/jira/browse/SPARK-22632 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > Note: wording is borrowed from SPARK-22395. Symptom is similar and I think > that JIRA is well descriptive. > When converting R's DataFrame from/to Spark DataFrame using > {{createDataFrame}} or {{collect}}, timestamp values behave to respect R > system timezone instead of session timezone. > For example, let's say we use "America/Los_Angeles" as session timezone and > have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in > South Korea so R timezone would be "KST". > The timestamp value from current collect() will be the following: > {code} > > sparkR.session(master = "local[*]", sparkConfig = > > list(spark.sql.session.timeZone = "America/Los_Angeles")) > > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts")) >ts > 1 1970-01-01 00:00:01 > > collect(sql("SELECT cast(28801 as timestamp) as ts")) >ts > 1 1970-01-01 17:00:01 > {code} > As you can see, the value becomes "1970-01-01 17:00:01" because it respects R > system timezone. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315434#comment-16315434 ] Felix Cheung commented on SPARK-21727: -- I think we should use is.atomic(object) ? > Operating on an ArrayType in a SparkR DataFrame throws error > > > Key: SPARK-21727 > URL: https://issues.apache.org/jira/browse/SPARK-21727 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Neil Alexander McQuarrie >Assignee: Neil Alexander McQuarrie > > Previously > [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] > this as a stack overflow question but it seems to be a bug. > If I have an R data.frame where one of the column data types is an integer > *list* -- i.e., each of the elements in the column embeds an entire R list of > integers -- then it seems I can convert this data.frame to a SparkR DataFrame > just fine... SparkR treats the column as ArrayType(Double). > However, any subsequent operation on this SparkR DataFrame appears to throw > an error. > Create an example R data.frame: > {code} > indices <- 1:4 > myDf <- data.frame(indices) > myDf$data <- list(rep(0, 20))}} > {code} > Examine it to make sure it looks okay: > {code} > > str(myDf) > 'data.frame': 4 obs. of 2 variables: > $ indices: int 1 2 3 4 > $ data :List of 4 >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... > > head(myDf) > indices data > 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > {code} > Convert it to a SparkR DataFrame: > {code} > library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) > sparkR.session(master = "local[*]") > mySparkDf <- as.DataFrame(myDf) > {code} > Examine the SparkR DataFrame schema; notice that the list column was > successfully converted to ArrayType: > {code} > > schema(mySparkDf) > StructType > |-name = "indices", type = "IntegerType", nullable = TRUE > |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE > {code} > However, operating on the SparkR DataFrame throws an error: > {code} > > collect(mySparkDf) > 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of array > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 > ... long stack trace ... > {code} > Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22952) Deprecate stageAttemptId in favour of stageAttemptNumber
[ https://issues.apache.org/jira/browse/SPARK-22952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22952: Assignee: Apache Spark > Deprecate stageAttemptId in favour of stageAttemptNumber > > > Key: SPARK-22952 > URL: https://issues.apache.org/jira/browse/SPARK-22952 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Assignee: Apache Spark >Priority: Minor > > As discussed in [PR-20082|https://github.com/apache/spark/pull/20082] for > SPARK-22897, we prefer stageAttemptNumber over stageAttemptId. > This is the followup to deprecate stageAttemptId which will make public APIs > more consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22952) Deprecate stageAttemptId in favour of stageAttemptNumber
[ https://issues.apache.org/jira/browse/SPARK-22952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22952: Assignee: (was: Apache Spark) > Deprecate stageAttemptId in favour of stageAttemptNumber > > > Key: SPARK-22952 > URL: https://issues.apache.org/jira/browse/SPARK-22952 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Priority: Minor > > As discussed in [PR-20082|https://github.com/apache/spark/pull/20082] for > SPARK-22897, we prefer stageAttemptNumber over stageAttemptId. > This is the followup to deprecate stageAttemptId which will make public APIs > more consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22952) Deprecate stageAttemptId in favour of stageAttemptNumber
[ https://issues.apache.org/jira/browse/SPARK-22952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315356#comment-16315356 ] Apache Spark commented on SPARK-22952: -- User 'advancedxy' has created a pull request for this issue: https://github.com/apache/spark/pull/20178 > Deprecate stageAttemptId in favour of stageAttemptNumber > > > Key: SPARK-22952 > URL: https://issues.apache.org/jira/browse/SPARK-22952 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Priority: Minor > > As discussed in [PR-20082|https://github.com/apache/spark/pull/20082] for > SPARK-22897, we prefer stageAttemptNumber over stageAttemptId. > This is the followup to deprecate stageAttemptId which will make public APIs > more consistent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")
[ https://issues.apache.org/jira/browse/SPARK-22954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315336#comment-16315336 ] Suchith J N commented on SPARK-22954: - I have opened a pull request. > ANALYZE TABLE fails with NoSuchTableException for temporary tables (but > should have reported "not supported on views") > -- > > Key: SPARK-22954 > URL: https://issues.apache.org/jira/browse/SPARK-22954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: {code} > $ ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152 > Branch master > Compiled by user jacek on 2018-01-04T05:44:05Z > Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8 > {code} >Reporter: Jacek Laskowski >Priority: Minor > > {{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' > not found in database 'default';}} for temporary tables (views) while the > reason is that it can only work with permanent tables (which [it can > report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38] > if it had a chance). > {code} > scala> names.createOrReplaceTempView("names") > scala> sql("ANALYZE TABLE names COMPUTE STATISTICS") > org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view > 'names' not found in database 'default'; > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398) > at > org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) > at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243) > at org.apache.spark.sql.Dataset.(Dataset.scala:187) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")
[ https://issues.apache.org/jira/browse/SPARK-22954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315334#comment-16315334 ] Apache Spark commented on SPARK-22954: -- User 'suchithjn225' has created a pull request for this issue: https://github.com/apache/spark/pull/20177 > ANALYZE TABLE fails with NoSuchTableException for temporary tables (but > should have reported "not supported on views") > -- > > Key: SPARK-22954 > URL: https://issues.apache.org/jira/browse/SPARK-22954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: {code} > $ ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152 > Branch master > Compiled by user jacek on 2018-01-04T05:44:05Z > Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8 > {code} >Reporter: Jacek Laskowski >Priority: Minor > > {{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' > not found in database 'default';}} for temporary tables (views) while the > reason is that it can only work with permanent tables (which [it can > report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38] > if it had a chance). > {code} > scala> names.createOrReplaceTempView("names") > scala> sql("ANALYZE TABLE names COMPUTE STATISTICS") > org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view > 'names' not found in database 'default'; > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398) > at > org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) > at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243) > at org.apache.spark.sql.Dataset.(Dataset.scala:187) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")
[ https://issues.apache.org/jira/browse/SPARK-22954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22954: Assignee: (was: Apache Spark) > ANALYZE TABLE fails with NoSuchTableException for temporary tables (but > should have reported "not supported on views") > -- > > Key: SPARK-22954 > URL: https://issues.apache.org/jira/browse/SPARK-22954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: {code} > $ ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152 > Branch master > Compiled by user jacek on 2018-01-04T05:44:05Z > Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8 > {code} >Reporter: Jacek Laskowski >Priority: Minor > > {{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' > not found in database 'default';}} for temporary tables (views) while the > reason is that it can only work with permanent tables (which [it can > report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38] > if it had a chance). > {code} > scala> names.createOrReplaceTempView("names") > scala> sql("ANALYZE TABLE names COMPUTE STATISTICS") > org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view > 'names' not found in database 'default'; > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398) > at > org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) > at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243) > at org.apache.spark.sql.Dataset.(Dataset.scala:187) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22954) ANALYZE TABLE fails with NoSuchTableException for temporary tables (but should have reported "not supported on views")
[ https://issues.apache.org/jira/browse/SPARK-22954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22954: Assignee: Apache Spark > ANALYZE TABLE fails with NoSuchTableException for temporary tables (but > should have reported "not supported on views") > -- > > Key: SPARK-22954 > URL: https://issues.apache.org/jira/browse/SPARK-22954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: {code} > $ ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152 > Branch master > Compiled by user jacek on 2018-01-04T05:44:05Z > Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8 > {code} >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Minor > > {{ANALYZE TABLE}} fails with {{NoSuchTableException: Table or view 'names' > not found in database 'default';}} for temporary tables (views) while the > reason is that it can only work with permanent tables (which [it can > report|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala#L38] > if it had a chance). > {code} > scala> names.createOrReplaceTempView("names") > scala> sql("ANALYZE TABLE names COMPUTE STATISTICS") > org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view > 'names' not found in database 'default'; > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:181) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398) > at > org.apache.spark.sql.execution.command.AnalyzeTableCommand.run(AnalyzeTableCommand.scala:36) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:187) > at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3244) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3243) > at org.apache.spark.sql.Dataset.(Dataset.scala:187) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:72) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22980) Wrong answer when using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-22980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315217#comment-16315217 ] Hyukjin Kwon commented on SPARK-22980: -- I think that's because we expect Pandas's Series in Scala pandas's udf. > Wrong answer when using pandas_udf > -- > > Key: SPARK-22980 > URL: https://issues.apache.org/jira/browse/SPARK-22980 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Blocker > > {noformat} > from pyspark.sql.functions import pandas_udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = pandas_udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > {noformat} > from pyspark.sql.functions import udf > from pyspark.sql.functions import col, lit > from pyspark.sql.types import LongType > df = spark.range(3) > f = udf(lambda x, y: len(x) + y, LongType()) > df.select(f(lit('text'), col('id'))).show() > {noformat} > The results of pandas_udf are different from udf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org