[jira] [Commented] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream
[ https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435019#comment-16435019 ] Takeshi Yamamuro commented on SPARK-12105: -- I feel it is enough to use Console? {code} val outputStream = new java.io.ByteArrayOutputStream() val out = new java.io.PrintStream(outputStream, true); Console.withOut(out) { Seq(1, 2, 3, 4).toDS.show() } scala> outputStream.toString res1: String = "+-+ |value| +-+ |1| |2| |3| |4| +-+ " {code} > Add a DataFrame.show() with argument for output PrintStream > --- > > Key: SPARK-12105 > URL: https://issues.apache.org/jira/browse/SPARK-12105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: Dean Wampler >Priority: Minor > > It would be nice to send the output of DataFrame.show(...) to a different > output stream than stdout, including just capturing the string itself. This > is useful, e.g., for testing. Actually, it would be sufficient and perhaps > better to just make DataFrame.showString a public method, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23912) High-order function: array_distinct(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434926#comment-16434926 ] Apache Spark commented on SPARK-23912: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/21050 > High-order function: array_distinct(x) → array > -- > > Key: SPARK-23912 > URL: https://issues.apache.org/jira/browse/SPARK-23912 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Remove duplicate values from the array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23912) High-order function: array_distinct(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23912: Assignee: (was: Apache Spark) > High-order function: array_distinct(x) → array > -- > > Key: SPARK-23912 > URL: https://issues.apache.org/jira/browse/SPARK-23912 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Remove duplicate values from the array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23912) High-order function: array_distinct(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23912: Assignee: Apache Spark > High-order function: array_distinct(x) → array > -- > > Key: SPARK-23912 > URL: https://issues.apache.org/jira/browse/SPARK-23912 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Remove duplicate values from the array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23957) Sorts in subqueries are redundant and can be removed
[ https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23957: Assignee: Apache Spark > Sorts in subqueries are redundant and can be removed > > > Key: SPARK-23957 > URL: https://issues.apache.org/jira/browse/SPARK-23957 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Assignee: Apache Spark >Priority: Major > > Unless combined with a {{LIMIT}}, there's no correctness reason that planned > and optimized subqueries should have any sort operators (since the result of > the subquery is an unordered collection of tuples). > For example: > {{SELECT count(1) FROM (select id FROM dft ORDER by id)}} > has the following plan: > {code:java} > == Physical Plan == > *(3) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition >+- *(2) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(2) Project > +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) >+- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct > {code} > ... but the sort operator is redundant. > Less intuitively, the sort is also redundant in selections from an ordered > subquery: > {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}} > has plan: > {code:java} > == Physical Plan == > *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) >+- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > {code} > ... but again, since the subquery returns a bag of tuples, the sort is > unnecessary. > We should consider adding an optimizer rule that removes a sort inside a > subquery. SPARK-23375 is related, but removes sorts that are functionally > redundant because they perform the same ordering. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23957) Sorts in subqueries are redundant and can be removed
[ https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434918#comment-16434918 ] Apache Spark commented on SPARK-23957: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/21049 > Sorts in subqueries are redundant and can be removed > > > Key: SPARK-23957 > URL: https://issues.apache.org/jira/browse/SPARK-23957 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Priority: Major > > Unless combined with a {{LIMIT}}, there's no correctness reason that planned > and optimized subqueries should have any sort operators (since the result of > the subquery is an unordered collection of tuples). > For example: > {{SELECT count(1) FROM (select id FROM dft ORDER by id)}} > has the following plan: > {code:java} > == Physical Plan == > *(3) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition >+- *(2) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(2) Project > +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) >+- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct > {code} > ... but the sort operator is redundant. > Less intuitively, the sort is also redundant in selections from an ordered > subquery: > {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}} > has plan: > {code:java} > == Physical Plan == > *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) >+- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > {code} > ... but again, since the subquery returns a bag of tuples, the sort is > unnecessary. > We should consider adding an optimizer rule that removes a sort inside a > subquery. SPARK-23375 is related, but removes sorts that are functionally > redundant because they perform the same ordering. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23957) Sorts in subqueries are redundant and can be removed
[ https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23957: Assignee: (was: Apache Spark) > Sorts in subqueries are redundant and can be removed > > > Key: SPARK-23957 > URL: https://issues.apache.org/jira/browse/SPARK-23957 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Priority: Major > > Unless combined with a {{LIMIT}}, there's no correctness reason that planned > and optimized subqueries should have any sort operators (since the result of > the subquery is an unordered collection of tuples). > For example: > {{SELECT count(1) FROM (select id FROM dft ORDER by id)}} > has the following plan: > {code:java} > == Physical Plan == > *(3) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition >+- *(2) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(2) Project > +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) >+- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct > {code} > ... but the sort operator is redundant. > Less intuitively, the sort is also redundant in selections from an ordered > subquery: > {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}} > has plan: > {code:java} > == Physical Plan == > *(2) Sort [id#0L ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) >+- *(1) Project [id#0L] > +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > {code} > ... but again, since the subquery returns a bag of tuples, the sort is > unnecessary. > We should consider adding an optimizer rule that removes a sort inside a > subquery. SPARK-23375 is related, but removes sorts that are functionally > redundant because they perform the same ordering. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.
[ https://issues.apache.org/jira/browse/SPARK-23958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23958. -- Resolution: Duplicate > HadoopRdd filters empty files to avoid generating empty tasks that affect the > performance of the Spark computing performance. > - > > Key: SPARK-23958 > URL: https://issues.apache.org/jira/browse/SPARK-23958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > > HadoopRdd filter empty files to avoid generating empty tasks that affect the > performance of the Spark computing performance. > Empty file's length is zero. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23950) Coalescing an empty dataframe to 1 partition
[ https://issues.apache.org/jira/browse/SPARK-23950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23950. -- Resolution: Cannot Reproduce > Coalescing an empty dataframe to 1 partition > > > Key: SPARK-23950 > URL: https://issues.apache.org/jira/browse/SPARK-23950 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 > Environment: Operating System: Windows 7 > Tested in Jupyter notebooks using Python 2.7.14 and Python 3.6.3. > Hardware specs not relevant to the issue. >Reporter: João Neves >Priority: Major > > Coalescing an empty dataframe to 1 partition returns an error. > The funny thing is that coalescing an empty dataframe to 2 or more partitions > seem to work. > The test case is the following: > {code} > from pyspark.sql.types import StructType > df = spark.createDataFrame(spark.sparkContext.emptyRDD(), StructType([])) > print(df.coalesce(2).count()) > print(df.coalesce(3).count()) > print(df.coalesce(4).count()) > df.coalesce(1).count(){code} > Output: > {code:java} > 0 > 0 > 0 > --- > Py4JJavaError Traceback (most recent call last) > in () > 7 print(df.coalesce(4).count()) > 8 > > 9 print(df.coalesce(1).count()) > C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in count(self) > 425 2 > 426 """ > --> 427 return int(self._jdf.count()) > 428 > 429 @ignore_unicode_prefix > c:\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args) > 1131 answer = self.gateway_client.send_command(command) > 1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) > 1134 > 1135 for temp_arg in temp_args: > C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > c:\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, > gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o176.count. > : java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2435) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2434) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2434) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Unknown Source){code} > Shouldn't this be consistent? > Thank you very much. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent
[ https://issues.apache.org/jira/browse/SPARK-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23965. -- Resolution: Won't Fix > make python/py4j-src-0.x.y.zip file name Spark version-independent > -- > > Key: SPARK-23965 > URL: https://issues.apache.org/jira/browse/SPARK-23965 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.1, 2.3.0, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > After each Spark release (that's normally packaged with slightly newer > version of py4j), we have to adjust our PySpark applications PYTHONPATH to > point to correct version of python/py4j-src-0.9.2.zip. > Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next > release to something else etc. > Possible solutions. Would be great to either > - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or > `python/py4j-src-current.zip` > - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever > version Spark is shipped with. > In either case, if this would be solved, we wouldn't have to adjust > PYTHONPATH during upgrades like Spark 2.2 to 2.3.. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23955) typo in parameter name 'rawPredicition'
[ https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23955. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/21030 > typo in parameter name 'rawPredicition' > --- > > Key: SPARK-23955 > URL: https://issues.apache.org/jira/browse/SPARK-23955 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: John Bauer >Priority: Trivial > > classifier.py MultilayerPerceptronClassifier.__init__ API call had typo > rawPredicition instead of rawPrediction > also present in doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent
[ https://issues.apache.org/jira/browse/SPARK-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434810#comment-16434810 ] Hyukjin Kwon commented on SPARK-23965: -- I would leave this resolved. I don't think it's a strong reason to rename or make a link, IMHO. > make python/py4j-src-0.x.y.zip file name Spark version-independent > -- > > Key: SPARK-23965 > URL: https://issues.apache.org/jira/browse/SPARK-23965 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.1, 2.3.0, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > After each Spark release (that's normally packaged with slightly newer > version of py4j), we have to adjust our PySpark applications PYTHONPATH to > point to correct version of python/py4j-src-0.9.2.zip. > Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next > release to something else etc. > Possible solutions. Would be great to either > - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or > `python/py4j-src-current.zip` > - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever > version Spark is shipped with. > In either case, if this would be solved, we wouldn't have to adjust > PYTHONPATH during upgrades like Spark 2.2 to 2.3.. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent
[ https://issues.apache.org/jira/browse/SPARK-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434809#comment-16434809 ] Hyukjin Kwon commented on SPARK-23965: -- I think that sounds we are going to more make the thridparty library dependent on Spark itself. Another simple solution I used a long while ago before: {code} export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH {code} > make python/py4j-src-0.x.y.zip file name Spark version-independent > -- > > Key: SPARK-23965 > URL: https://issues.apache.org/jira/browse/SPARK-23965 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.1, 2.3.0, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > After each Spark release (that's normally packaged with slightly newer > version of py4j), we have to adjust our PySpark applications PYTHONPATH to > point to correct version of python/py4j-src-0.9.2.zip. > Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next > release to something else etc. > Possible solutions. Would be great to either > - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or > `python/py4j-src-current.zip` > - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever > version Spark is shipped with. > In either case, if this would be solved, we wouldn't have to adjust > PYTHONPATH during upgrades like Spark 2.2 to 2.3.. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23956) Use effective RPC port in AM registration
[ https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-23956: -- Priority: Minor (was: Major) > Use effective RPC port in AM registration > -- > > Key: SPARK-23956 > URL: https://issues.apache.org/jira/browse/SPARK-23956 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Gera Shegalov >Priority: Minor > > AM's should use their real rpc port in the AM registration for better > diagnostics in Application Report. > {code} > 18/04/10 14:56:21 INFO Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: localhost > ApplicationMaster RPC port: 58338 > queue: default > start time: 1523397373659 > final status: UNDEFINED > tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23961) pyspark toLocalIterator throws an exception
[ https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434799#comment-16434799 ] Hyukjin Kwon commented on SPARK-23961: -- FWIW, I met this issue a while ago too (and I gave up with using this at that time and forget to debug it ahead). > pyspark toLocalIterator throws an exception > --- > > Key: SPARK-23961 > URL: https://issues.apache.org/jira/browse/SPARK-23961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Michel Lemay >Priority: Minor > Labels: DataFrame, pyspark > > Given a dataframe and use toLocalIterator. If we do not consume all records, > it will throw: > {quote}ERROR PythonRDD: Error while sending iterator > java.net.SocketException: Connection reset by peer: socket write error > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at java.io.FilterOutputStream.write(FilterOutputStream.java:97) > at > org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706) > {quote} > > To reproduce, here is a simple pyspark shell script that show the error: > {quote}import itertools > df = spark.read.parquet("large parquet folder").cache() > print(df.count()) > b = df.toLocalIterator() > print(len(list(itertools.islice(b, 20 > b = None # Make the iterator goes out of scope. Throws here. > {quote} > > Observations: > * Consuming all records do not throw. Taking only a subset of the > partitions create the error. > * In another experiment, doing the same on a regular RDD works if we > cache/materialize it. If we do not cache the RDD, it throws similarly. > * It works in scala shell > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23920) High-order function: array_remove(x, element) → array
[ https://issues.apache.org/jira/browse/SPARK-23920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434786#comment-16434786 ] Huaxin Gao commented on SPARK-23920: I will work on this. Thanks! > High-order function: array_remove(x, element) → array > - > > Key: SPARK-23920 > URL: https://issues.apache.org/jira/browse/SPARK-23920 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Remove all elements that equal element from array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23966) Refactoring all checkpoint file writing logic in a common interface
[ https://issues.apache.org/jira/browse/SPARK-23966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434682#comment-16434682 ] Apache Spark commented on SPARK-23966: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/21048 > Refactoring all checkpoint file writing logic in a common interface > --- > > Key: SPARK-23966 > URL: https://issues.apache.org/jira/browse/SPARK-23966 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > Checkpoint files (offset log files, state store files) in Structured > Streaming must be written atomically such that no partial files are generated > (would break fault-tolerance guarantees). Currently, there are 3 locations > which try to do this individually, and in some cases, incorrectly. > # HDFSOffsetMetadataLog - This uses a FileManager interface to use any > implementation of `FileSystem` or `FileContext` APIs. It preferably loads > `FileContext` implementation as FileContext of HDFS has atomic renames. > # HDFSBackedStateStore (aka in-memory state store) > ## Writing a version.delta file - This uses FileSystem APIs only to perform > a rename. This is incorrect as rename is not atomic in HDFS FileSystem > implementation. > ## Writing a snapshot file - Same as above. > Current problems: > # State Store behavior is incorrect - > # Inflexible - Some file systems provide mechanisms other than > write-to-temp-file-and-rename for writing atomically and more efficiently. > For example, with S3 you can write directly to the final file and it will be > made visible only when the entire file is written and closed correctly. Any > failure can be made to terminate the writing without making any partial files > visible in S3. The current code does not abstract out this mechanism enough > that it can be customized. > Solution: > # Introduce a common interface that all 3 cases above can use to write > checkpoint files atomically. > # This interface must provide the necessary interfaces that allow > customization of the write-and-rename mechanism. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23966) Refactoring all checkpoint file writing logic in a common interface
[ https://issues.apache.org/jira/browse/SPARK-23966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23966: Assignee: Tathagata Das (was: Apache Spark) > Refactoring all checkpoint file writing logic in a common interface > --- > > Key: SPARK-23966 > URL: https://issues.apache.org/jira/browse/SPARK-23966 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > Checkpoint files (offset log files, state store files) in Structured > Streaming must be written atomically such that no partial files are generated > (would break fault-tolerance guarantees). Currently, there are 3 locations > which try to do this individually, and in some cases, incorrectly. > # HDFSOffsetMetadataLog - This uses a FileManager interface to use any > implementation of `FileSystem` or `FileContext` APIs. It preferably loads > `FileContext` implementation as FileContext of HDFS has atomic renames. > # HDFSBackedStateStore (aka in-memory state store) > ## Writing a version.delta file - This uses FileSystem APIs only to perform > a rename. This is incorrect as rename is not atomic in HDFS FileSystem > implementation. > ## Writing a snapshot file - Same as above. > Current problems: > # State Store behavior is incorrect - > # Inflexible - Some file systems provide mechanisms other than > write-to-temp-file-and-rename for writing atomically and more efficiently. > For example, with S3 you can write directly to the final file and it will be > made visible only when the entire file is written and closed correctly. Any > failure can be made to terminate the writing without making any partial files > visible in S3. The current code does not abstract out this mechanism enough > that it can be customized. > Solution: > # Introduce a common interface that all 3 cases above can use to write > checkpoint files atomically. > # This interface must provide the necessary interfaces that allow > customization of the write-and-rename mechanism. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23966) Refactoring all checkpoint file writing logic in a common interface
[ https://issues.apache.org/jira/browse/SPARK-23966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23966: Assignee: Apache Spark (was: Tathagata Das) > Refactoring all checkpoint file writing logic in a common interface > --- > > Key: SPARK-23966 > URL: https://issues.apache.org/jira/browse/SPARK-23966 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Major > > Checkpoint files (offset log files, state store files) in Structured > Streaming must be written atomically such that no partial files are generated > (would break fault-tolerance guarantees). Currently, there are 3 locations > which try to do this individually, and in some cases, incorrectly. > # HDFSOffsetMetadataLog - This uses a FileManager interface to use any > implementation of `FileSystem` or `FileContext` APIs. It preferably loads > `FileContext` implementation as FileContext of HDFS has atomic renames. > # HDFSBackedStateStore (aka in-memory state store) > ## Writing a version.delta file - This uses FileSystem APIs only to perform > a rename. This is incorrect as rename is not atomic in HDFS FileSystem > implementation. > ## Writing a snapshot file - Same as above. > Current problems: > # State Store behavior is incorrect - > # Inflexible - Some file systems provide mechanisms other than > write-to-temp-file-and-rename for writing atomically and more efficiently. > For example, with S3 you can write directly to the final file and it will be > made visible only when the entire file is written and closed correctly. Any > failure can be made to terminate the writing without making any partial files > visible in S3. The current code does not abstract out this mechanism enough > that it can be customized. > Solution: > # Introduce a common interface that all 3 cases above can use to write > checkpoint files atomically. > # This interface must provide the necessary interfaces that allow > customization of the write-and-rename mechanism. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23966) Refactoring all checkpoint file writing logic in a common interface
Tathagata Das created SPARK-23966: - Summary: Refactoring all checkpoint file writing logic in a common interface Key: SPARK-23966 URL: https://issues.apache.org/jira/browse/SPARK-23966 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Tathagata Das Assignee: Tathagata Das Checkpoint files (offset log files, state store files) in Structured Streaming must be written atomically such that no partial files are generated (would break fault-tolerance guarantees). Currently, there are 3 locations which try to do this individually, and in some cases, incorrectly. # HDFSOffsetMetadataLog - This uses a FileManager interface to use any implementation of `FileSystem` or `FileContext` APIs. It preferably loads `FileContext` implementation as FileContext of HDFS has atomic renames. # HDFSBackedStateStore (aka in-memory state store) ## Writing a version.delta file - This uses FileSystem APIs only to perform a rename. This is incorrect as rename is not atomic in HDFS FileSystem implementation. ## Writing a snapshot file - Same as above. Current problems: # State Store behavior is incorrect - # Inflexible - Some file systems provide mechanisms other than write-to-temp-file-and-rename for writing atomically and more efficiently. For example, with S3 you can write directly to the final file and it will be made visible only when the entire file is written and closed correctly. Any failure can be made to terminate the writing without making any partial files visible in S3. The current code does not abstract out this mechanism enough that it can be customized. Solution: # Introduce a common interface that all 3 cases above can use to write checkpoint files atomically. # This interface must provide the necessary interfaces that allow customization of the write-and-rename mechanism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23956) Use effective RPC port in AM registration
[ https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23956: Assignee: (was: Apache Spark) > Use effective RPC port in AM registration > -- > > Key: SPARK-23956 > URL: https://issues.apache.org/jira/browse/SPARK-23956 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Gera Shegalov >Priority: Major > > AM's should use their real rpc port in the AM registration for better > diagnostics in Application Report. > {code} > 18/04/10 14:56:21 INFO Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: localhost > ApplicationMaster RPC port: 58338 > queue: default > start time: 1523397373659 > final status: UNDEFINED > tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23956) Use effective RPC port in AM registration
[ https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434595#comment-16434595 ] Apache Spark commented on SPARK-23956: -- User 'gerashegalov' has created a pull request for this issue: https://github.com/apache/spark/pull/21047 > Use effective RPC port in AM registration > -- > > Key: SPARK-23956 > URL: https://issues.apache.org/jira/browse/SPARK-23956 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Gera Shegalov >Priority: Major > > AM's should use their real rpc port in the AM registration for better > diagnostics in Application Report. > {code} > 18/04/10 14:56:21 INFO Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: localhost > ApplicationMaster RPC port: 58338 > queue: default > start time: 1523397373659 > final status: UNDEFINED > tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23956) Use effective RPC port in AM registration
[ https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23956: Assignee: Apache Spark > Use effective RPC port in AM registration > -- > > Key: SPARK-23956 > URL: https://issues.apache.org/jira/browse/SPARK-23956 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Gera Shegalov >Assignee: Apache Spark >Priority: Major > > AM's should use their real rpc port in the AM registration for better > diagnostics in Application Report. > {code} > 18/04/10 14:56:21 INFO Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: localhost > ApplicationMaster RPC port: 58338 > queue: default > start time: 1523397373659 > final status: UNDEFINED > tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent
[ https://issues.apache.org/jira/browse/SPARK-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-23965: -- Description: After each Spark release (that's normally packaged with slightly newer version of py4j), we have to adjust our PySpark applications PYTHONPATH to point to correct version of python/py4j-src-0.9.2.zip. Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next release to something else etc. Possible solutions. Would be great to either - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or `python/py4j-src-current.zip` - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever version Spark is shipped with. In either case, if this would be solved, we wouldn't have to adjust PYTHONPATH during upgrades like Spark 2.2 to 2.3.. Thanks. was: After each Spark release (that's normally packaged with slightly newer version of py4j), we have to adjust our PySpark applications PYTHONPATH to point to correct version of python/py4j-src-0.9.2.zip. Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next release to something else etc. Possible solutions. Would be great to either - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or `python/py4j-src-current.zip` - make a symlink in Spark distributed `py4j-src-current.zip` to whatever version Spark is shipped with. Thanks. > make python/py4j-src-0.x.y.zip file name Spark version-independent > -- > > Key: SPARK-23965 > URL: https://issues.apache.org/jira/browse/SPARK-23965 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.1, 2.3.0, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > After each Spark release (that's normally packaged with slightly newer > version of py4j), we have to adjust our PySpark applications PYTHONPATH to > point to correct version of python/py4j-src-0.9.2.zip. > Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next > release to something else etc. > Possible solutions. Would be great to either > - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or > `python/py4j-src-current.zip` > - or make a symlink in Spark distributed `py4j-src-current.zip` to whatever > version Spark is shipped with. > In either case, if this would be solved, we wouldn't have to adjust > PYTHONPATH during upgrades like Spark 2.2 to 2.3.. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23955) typo in parameter name 'rawPredicition'
[ https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434555#comment-16434555 ] Apache Spark commented on SPARK-23955: -- User 'codeforfun15' has created a pull request for this issue: https://github.com/apache/spark/pull/21046 > typo in parameter name 'rawPredicition' > --- > > Key: SPARK-23955 > URL: https://issues.apache.org/jira/browse/SPARK-23955 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: John Bauer >Priority: Trivial > > classifier.py MultilayerPerceptronClassifier.__init__ API call had typo > rawPredicition instead of rawPrediction > also present in doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23955) typo in parameter name 'rawPredicition'
[ https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23955: Assignee: Apache Spark > typo in parameter name 'rawPredicition' > --- > > Key: SPARK-23955 > URL: https://issues.apache.org/jira/browse/SPARK-23955 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: John Bauer >Assignee: Apache Spark >Priority: Trivial > > classifier.py MultilayerPerceptronClassifier.__init__ API call had typo > rawPredicition instead of rawPrediction > also present in doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23955) typo in parameter name 'rawPredicition'
[ https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23955: Assignee: (was: Apache Spark) > typo in parameter name 'rawPredicition' > --- > > Key: SPARK-23955 > URL: https://issues.apache.org/jira/browse/SPARK-23955 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: John Bauer >Priority: Trivial > > classifier.py MultilayerPerceptronClassifier.__init__ API call had typo > rawPredicition instead of rawPrediction > also present in doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23965) make python/py4j-src-0.x.y.zip file name Spark version-independent
Ruslan Dautkhanov created SPARK-23965: - Summary: make python/py4j-src-0.x.y.zip file name Spark version-independent Key: SPARK-23965 URL: https://issues.apache.org/jira/browse/SPARK-23965 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.3.0, 2.2.1, 2.4.0 Reporter: Ruslan Dautkhanov After each Spark release (that's normally packaged with slightly newer version of py4j), we have to adjust our PySpark applications PYTHONPATH to point to correct version of python/py4j-src-0.9.2.zip. Change to python/py4j-src-0.9.2.zip to python/py4j-src-0.9.6.zip, next release to something else etc. Possible solutions. Would be great to either - rename `python/py4j-src-0.x.y.zip` to `python/py4j-src-latest.zip` or `python/py4j-src-current.zip` - make a symlink in Spark distributed `py4j-src-current.zip` to whatever version Spark is shipped with. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431403#comment-16431403 ] Bruce Robbins edited comment on SPARK-23715 at 4/11/18 8:09 PM: I've been convinced this is worth fixing, at least for String input values, since a user was actually seeing wrong results despite specifying a datetime value with a UTC timezone. One way to fix this is to create a new expression type for converting string values to timestamp values. The Analyzer would place this expression as a left child of FromUTCTimestamp, if needed. This new expression type would be more aware of FromUTCTimestamp's expectations than a general purpose Cast expression (for example, it could reject string datetime values that contain an explicit timezone). Any opinions? [~cloud_fan] [~smilegator]? was (Author: bersprockets): I've been convinced this is worth fixing, at least for String input values, since a user was actually seeing wrong results despite specifying a datetime value with a UTC timezone. One way to fix this is to create a new expression type for converting string values to timestamp values. The Analyzer would place this expression as a left child of FromUTCTimestamp, if needed. This new expression type would be more aware of FromUTCTimestamp's expectations than a general purpose Cast expression (for example, it could reject string datetime values that contain an explicit timezone). Any opinions? > from_utc_timestamp returns incorrect results for some UTC date/time values > -- > > Key: SPARK-23715 > URL: https://issues.apache.org/jira/browse/SPARK-23715 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Major > > This produces the expected answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 07:18:23| > +---+ > {noformat} > However, the equivalent UTC input (but with an explicit timezone) produces a > wrong answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > Additionally, the equivalent Unix time (1520921903, which is also > "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: > {noformat} > df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > These issues stem from the fact that the FromUTCTimestamp expression, despite > its name, expects the input to be in the user's local timezone. There is some > magic under the covers to make things work (mostly) as the user expects. > As an example, let's say a user in Los Angeles issues the following: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > {noformat} > FromUTCTimestamp gets as input a Timestamp (long) value representing > {noformat} > 2018-03-13T06:18:23-07:00 (long value 152094710300) > {noformat} > What FromUTCTimestamp needs instead is > {noformat} > 2018-03-13T06:18:23+00:00 (long value 152092190300) > {noformat} > So, it applies the local timezone's offset to the input timestamp to get the > correct value (152094710300 minus 7 hours is 152092190300). Then it > can process the value and produce the expected output. > When the user explicitly specifies a time zone, FromUTCTimestamp's > assumptions break down. The input is no longer in the local time zone. > Because of the way input data is implicitly casted, FromUTCTimestamp never > knows whether the input data had an explicit timezone. > Here are some gory details: > There is sometimes a mismatch in expectations between the (string => > timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp > expression never sees the actual input string (the cast "intercepts" the > input and converts it to a long timestamp before FromUTCTimestamp uses the > value), FromUTCTimestamp cannot reject any input value that would exercise > this mismatch in expectations. > There is a similar mismatch in expectations in the (integer => timestamp) > cast and FromUTCTimestamp. As a result, Unix time input almost always > produces incorrect output. > h3. When things work as expected for String input: > When from_utc_timestamp is passed a string time value with no time zone, > DateTimeUtils.stringToTimestamp (called from
[jira] [Commented] (SPARK-23914) High-order function: array_union(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434494#comment-16434494 ] Kazuaki Ishizaki commented on SPARK-23914: -- I will work for this, thank you. > High-order function: array_union(x, y) → array > -- > > Key: SPARK-23914 > URL: https://issues.apache.org/jira/browse/SPARK-23914 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of the elements in the union of x and y, without duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23913) High-order function: array_intersect(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434491#comment-16434491 ] Kazuaki Ishizaki commented on SPARK-23913: -- I will work for this, thank you. > High-order function: array_intersect(x, y) → array > -- > > Key: SPARK-23913 > URL: https://issues.apache.org/jira/browse/SPARK-23913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of the elements in the intersection of x and y, without > duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23964) why does Spillable wait for 32 elements?
[ https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434479#comment-16434479 ] Thomas Graves commented on SPARK-23964: --- I'm not sure, I'm trying to figure out if there is a performance implications here and perhaps there are but its at the cost of not being accurate on memory usage. In the deployments with fixed sized containers this is very important. if you wait 32 elements it may cause you to acquire a bigger chunk of memory at once vs getting smaller allocations (thus more). I would think the only check you need is: currentMemory >= myMemoryThreshold, the initial threshold is 5MB right now but all its doing is asking for more memory, only when it can't get memory does it spill. And the initial threshold is configurable so you can always make it bigger. I'm going to try to do some performance tests to see what happens but would like to know if anyone has other background. > why does Spillable wait for 32 elements? > > > Key: SPARK-23964 > URL: https://issues.apache.org/jira/browse/SPARK-23964 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Thomas Graves >Priority: Major > > The spillable class has a check in maybeSpill as to when it tries to acquire > more memory and determine if it should spill: > if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { > Before it looks to see if it should spill. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] > I'm wondering why it has the elementsRead %32 in it? If I have a small > number of elements that are huge this can easily cause OOM before we actually > spill. > I saw a few conversations on this and one Jira related: > https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an > answer to this. > anyone have history on this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array
[ https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23931: Assignee: Apache Spark > High-order function: zip(array1, array2[, ...]) → array > > > Key: SPARK-23931 > URL: https://issues.apache.org/jira/browse/SPARK-23931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Merges the given arrays, element-wise, into a single array of rows. The M-th > element of the N-th argument will be the N-th field of the M-th output > element. If the arguments have an uneven length, missing values are filled > with NULL. > {noformat} > SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, > null), ROW(null, '3b')] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array
[ https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434475#comment-16434475 ] Apache Spark commented on SPARK-23931: -- User 'DylanGuedes' has created a pull request for this issue: https://github.com/apache/spark/pull/21045 > High-order function: zip(array1, array2[, ...]) → array > > > Key: SPARK-23931 > URL: https://issues.apache.org/jira/browse/SPARK-23931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Merges the given arrays, element-wise, into a single array of rows. The M-th > element of the N-th argument will be the N-th field of the M-th output > element. If the arguments have an uneven length, missing values are filled > with NULL. > {noformat} > SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, > null), ROW(null, '3b')] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array
[ https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23931: Assignee: (was: Apache Spark) > High-order function: zip(array1, array2[, ...]) → array > > > Key: SPARK-23931 > URL: https://issues.apache.org/jira/browse/SPARK-23931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Merges the given arrays, element-wise, into a single array of rows. The M-th > element of the N-th argument will be the N-th field of the M-th output > element. If the arguments have an uneven length, missing values are filled > with NULL. > {noformat} > SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, > null), ROW(null, '3b')] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23964) why does Spillable wait for 32 elements?
[ https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434460#comment-16434460 ] Reynold Xin commented on SPARK-23964: - Was it trying to reduce overhead? > why does Spillable wait for 32 elements? > > > Key: SPARK-23964 > URL: https://issues.apache.org/jira/browse/SPARK-23964 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Thomas Graves >Priority: Major > > The spillable class has a check in maybeSpill as to when it tries to acquire > more memory and determine if it should spill: > if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { > Before it looks to see if it should spill. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] > I'm wondering why it has the elementsRead %32 in it? If I have a small > number of elements that are huge this can easily cause OOM before we actually > spill. > I saw a few conversations on this and one Jira related: > https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an > answer to this. > anyone have history on this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23964) why does Spillable wait for 32 elements?
[ https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-23964: -- Description: The spillable class has a check in maybeSpill as to when it tries to acquire more memory and determine if it should spill: if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { Before it looks to see if it should spill. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] I'm wondering why it has the elementsRead %32 in it? If I have a small number of elements that are huge this can easily cause OOM before we actually spill. I saw a few conversations on this and one Jira related: https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an answer to this. anyone have history on this? was: The spillable class has a check: if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { Before it looks to see if it should spill. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] I'm wondering why it has the elementsRead %32 in it? If I have a small number of elements that are huge this can easily cause OOM before we actually spill. I saw a few conversations on this and one Jira related: https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an answer to this. anyone have history on this? > why does Spillable wait for 32 elements? > > > Key: SPARK-23964 > URL: https://issues.apache.org/jira/browse/SPARK-23964 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Thomas Graves >Priority: Major > > The spillable class has a check in maybeSpill as to when it tries to acquire > more memory and determine if it should spill: > if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { > Before it looks to see if it should spill. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] > I'm wondering why it has the elementsRead %32 in it? If I have a small > number of elements that are huge this can easily cause OOM before we actually > spill. > I saw a few conversations on this and one Jira related: > https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an > answer to this. > anyone have history on this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23964) why does Spillable wait for 32 elements?
[ https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434454#comment-16434454 ] Thomas Graves commented on SPARK-23964: --- [~andrewor14] [~matei] [~r...@databricks.com] A few related threads: [https://github.com/apache/spark/pull/3302] [https://github.com/apache/spark/pull/3656] https://github.com/apache/spark/commit/3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099 > why does Spillable wait for 32 elements? > > > Key: SPARK-23964 > URL: https://issues.apache.org/jira/browse/SPARK-23964 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Thomas Graves >Priority: Major > > The spillable class has a check: > if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { > Before it looks to see if it should spill. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] > I'm wondering why it has the elementsRead %32 in it? If I have a small > number of elements that are huge this can easily cause OOM before we actually > spill. > I saw a few conversations on this and one Jira related: > https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an > answer to this. > anyone have history on this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9312) The OneVsRest model does not provide confidence factor(not probability) along with the prediction
[ https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434447#comment-16434447 ] Apache Spark commented on SPARK-9312: - User 'ludatabricks' has created a pull request for this issue: https://github.com/apache/spark/pull/21044 > The OneVsRest model does not provide confidence factor(not probability) along > with the prediction > - > > Key: SPARK-9312 > URL: https://issues.apache.org/jira/browse/SPARK-9312 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.0, 1.4.1 >Reporter: Badari Madhav >Priority: Major > Labels: features > Original Estimate: 72h > Remaining Estimate: 72h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23964) why does Spillable wait for 32 elements?
[ https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-23964: -- Description: The spillable class has a check: if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { Before it looks to see if it should spill. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] I'm wondering why it has the elementsRead %32 in it? If I have a small number of elements that are huge this can easily cause OOM before we actually spill. I saw a few conversations on this and one Jira related: https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an answer to this. anyone have history on this? > why does Spillable wait for 32 elements? > > > Key: SPARK-23964 > URL: https://issues.apache.org/jira/browse/SPARK-23964 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Thomas Graves >Priority: Major > > The spillable class has a check: > if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { > Before it looks to see if it should spill. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] > I'm wondering why it has the elementsRead %32 in it? If I have a small > number of elements that are huge this can easily cause OOM before we actually > spill. > I saw a few conversations on this and one Jira related: > https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an > answer to this. > anyone have history on this? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23964) why does Spillable wait for 32 elements?
[ https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-23964: -- Environment: (was: The spillable class has a check: if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { Before it looks to see if it should spill. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] I'm wondering why it has the elementsRead %32 in it? If I have a small number of elements that are huge this can easily cause OOM before we actually spill. I saw a few conversations on this and one Jira related: https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an answer to this. anyone have history on this?) > why does Spillable wait for 32 elements? > > > Key: SPARK-23964 > URL: https://issues.apache.org/jira/browse/SPARK-23964 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Thomas Graves >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23964) why does Spillable wait for 32 elements?
Thomas Graves created SPARK-23964: - Summary: why does Spillable wait for 32 elements? Key: SPARK-23964 URL: https://issues.apache.org/jira/browse/SPARK-23964 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Environment: The spillable class has a check: if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { Before it looks to see if it should spill. [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83] I'm wondering why it has the elementsRead %32 in it? If I have a small number of elements that are huge this can easily cause OOM before we actually spill. I saw a few conversations on this and one Jira related: https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an answer to this. anyone have history on this? Reporter: Thomas Graves -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22883. --- Resolution: Fixed Fix Version/s: 2.3.1 Issue resolved by pull request 21042 [https://github.com/apache/spark/pull/21042] > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23961) pyspark toLocalIterator throws an exception
[ https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michel Lemay updated SPARK-23961: - Description: Given a dataframe and use toLocalIterator. If we do not consume all records, it will throw: {quote}ERROR PythonRDD: Error while sending iterator java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) at java.net.SocketOutputStream.write(SocketOutputStream.java:155) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706) {quote} To reproduce, here is a simple pyspark shell script that show the error: {quote}import itertools df = spark.read.parquet("large parquet folder").cache() print(df.count()) b = df.toLocalIterator() print(len(list(itertools.islice(b, 20 b = None # Make the iterator goes out of scope. Throws here. {quote} Observations: * Consuming all records do not throw. Taking only a subset of the partitions create the error. * In another experiment, doing the same on a regular RDD works if we cache/materialize it. If we do not cache the RDD, it throws similarly. * It works in scala shell was: Given a dataframe, take it's rdd and use toLocalIterator. If we do not consume all records, it will throw: {quote}ERROR PythonRDD: Error while sending iterator java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) at java.net.SocketOutputStream.write(SocketOutputStream.java:155) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706) {quote} To reproduce, here is a simple pyspark shell script that show the error: {quote}import itertools df = spark.read.parquet("large parquet folder") cachedRDD = df.rdd.cache() print(cachedRDD.count()) # materialize b = cachedRDD.toLocalIterator() print(len(list(itertools.islice(b, 20 b = None # Make the iterator goes out of scope. Throws here. {quote} Observations: * Consuming all records do not throw. Taking only a subset of the partitions create the error. * In another experiment, doing the same on a regular RDD works if we cache/materialize it. If we do not cache the RDD, it throws similarly. * It works in scala shell > pyspark toLocalIterator throws an exception > --- > > Key: SPARK-23961 > URL: https://issues.apache.org/jira/browse/SPARK-23961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >
[jira] [Assigned] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23963: Assignee: (was: Apache Spark) > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434265#comment-16434265 ] Apache Spark commented on SPARK-23963: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/21043 > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23963: Assignee: Apache Spark > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Minor > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks
[ https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23948: - Component/s: Scheduler > Trigger mapstage's job listener in submitMissingTasks > - > > Key: SPARK-23948 > URL: https://issues.apache.org/jira/browse/SPARK-23948 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: jin xing >Priority: Major > > SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, > "markMapStageJobAsFinished" is called only in > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 > and > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314); > But think about below scenario: > 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0; > 2. We submit stage1 by "submitMapStage"; > 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got > resubmitted as stage0_1 and stage1_1; > 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, > but stage1 is not inside "runningStages". So even though all splits(including > the speculated tasks) in stage1 succeeded, job listener in stage1 will not be > called; > 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", > there is no missing tasks. But in current code, job listener is not triggered -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22883: -- Fix Version/s: 2.4.0 > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 2.4.0 > > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22883: -- Target Version/s: 2.3.1, 2.4.0 > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 2.4.0 > > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434238#comment-16434238 ] Apache Spark commented on SPARK-22883: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/21042 > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23734) InvalidSchemaException While Saving ALSModel
[ https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434219#comment-16434219 ] Stanley Poon edited comment on SPARK-23734 at 4/11/18 4:58 PM: --- [~viirya] Thank you for checking into this. I added the Spark release details where this is reproducible - v2.3.0-rc5 released on Feb 22, 2018. And will verify that it is fixed in the next release. was (Author: spoon): [~viirya] Thank you for checking into this. I added the Spark release details where this is reproducible. And will verify that it is fixed in the next release. > InvalidSchemaException While Saving ALSModel > > > Key: SPARK-23734 > URL: https://issues.apache.org/jira/browse/SPARK-23734 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: macOS 10.13.2 > Scala 2.11.8 > Spark 2.3.0 v2.3.0-rc5 >Reporter: Stanley Poon >Priority: Major > Labels: ALS, parquet, persistence > > After fitting an ALSModel, get following error while saving the model: > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > Exactly the same code ran ok on 2.2.1. > Same issue also occurs on other ALSModels we have. > h2. *To reproduce* > Get ALSExample: > [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala] > and add the following line to save the model right before "spark.stop". > {quote} model.write.overwrite().save("SparkExampleALSModel") > {quote} > h2. Stack Trace > Exception in thread "main" java.lang.ExceptionInInitializerError > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) > at > org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103) > at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83) > at com.vitalmove.model.ALSExample.main(ALSExample.scala) > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > at org.apache.parquet.schema.GroupType.(GroupType.java:92) > at org.apache.parquet.schema.GroupType.(GroupType.java:48) > at org.apache.parqu
[jira] [Updated] (SPARK-23734) InvalidSchemaException While Saving ALSModel
[ https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanley Poon updated SPARK-23734: - Environment: macOS 10.13.2 Scala 2.11.8 Spark 2.3.0 v2.3.0-rc5 (Feb 22 2018) was: macOS 10.13.2 Scala 2.11.8 Spark 2.3.0 v2.3.0-rc5 > InvalidSchemaException While Saving ALSModel > > > Key: SPARK-23734 > URL: https://issues.apache.org/jira/browse/SPARK-23734 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: macOS 10.13.2 > Scala 2.11.8 > Spark 2.3.0 v2.3.0-rc5 (Feb 22 2018) >Reporter: Stanley Poon >Priority: Major > Labels: ALS, parquet, persistence > > After fitting an ALSModel, get following error while saving the model: > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > Exactly the same code ran ok on 2.2.1. > Same issue also occurs on other ALSModels we have. > h2. *To reproduce* > Get ALSExample: > [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala] > and add the following line to save the model right before "spark.stop". > {quote} model.write.overwrite().save("SparkExampleALSModel") > {quote} > h2. Stack Trace > Exception in thread "main" java.lang.ExceptionInInitializerError > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) > at > org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103) > at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83) > at com.vitalmove.model.ALSExample.main(ALSExample.scala) > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > at org.apache.parquet.schema.GroupType.(GroupType.java:92) > at org.apache.parquet.schema.GroupType.(GroupType.java:48) > at org.apache.parquet.schema.MessageType.(MessageType.java:50) > at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(Parque
[jira] [Commented] (SPARK-23734) InvalidSchemaException While Saving ALSModel
[ https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434219#comment-16434219 ] Stanley Poon commented on SPARK-23734: -- [~viirya] Thank you for checking into this. I added the Spark release details where this is reproducible. And will verify that it is fixed in the next release. > InvalidSchemaException While Saving ALSModel > > > Key: SPARK-23734 > URL: https://issues.apache.org/jira/browse/SPARK-23734 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: macOS 10.13.2 > Scala 2.11.8 > Spark 2.3.0 v2.3.0-rc5 >Reporter: Stanley Poon >Priority: Major > Labels: ALS, parquet, persistence > > After fitting an ALSModel, get following error while saving the model: > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > Exactly the same code ran ok on 2.2.1. > Same issue also occurs on other ALSModels we have. > h2. *To reproduce* > Get ALSExample: > [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala] > and add the following line to save the model right before "spark.stop". > {quote} model.write.overwrite().save("SparkExampleALSModel") > {quote} > h2. Stack Trace > Exception in thread "main" java.lang.ExceptionInInitializerError > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) > at > org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103) > at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83) > at com.vitalmove.model.ALSExample.main(ALSExample.scala) > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > at org.apache.parquet.schema.GroupType.(GroupType.java:92) > at org.apache.parquet.schema.GroupType.(GroupType.java:48) > at org.apache.parquet.schema.MessageType.(MessageType.java:50) > at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567) > at > org.apache.spark.sql.ex
[jira] [Updated] (SPARK-23734) InvalidSchemaException While Saving ALSModel
[ https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanley Poon updated SPARK-23734: - Environment: macOS 10.13.2 Scala 2.11.8 Spark 2.3.0 v2.3.0-rc5 was: macOS 10.13.2 Scala 2.11.8 Spark 2.3.0 > InvalidSchemaException While Saving ALSModel > > > Key: SPARK-23734 > URL: https://issues.apache.org/jira/browse/SPARK-23734 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: macOS 10.13.2 > Scala 2.11.8 > Spark 2.3.0 v2.3.0-rc5 >Reporter: Stanley Poon >Priority: Major > Labels: ALS, parquet, persistence > > After fitting an ALSModel, get following error while saving the model: > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > Exactly the same code ran ok on 2.2.1. > Same issue also occurs on other ALSModels we have. > h2. *To reproduce* > Get ALSExample: > [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala] > and add the following line to save the model right before "spark.stop". > {quote} model.write.overwrite().save("SparkExampleALSModel") > {quote} > h2. Stack Trace > Exception in thread "main" java.lang.ExceptionInInitializerError > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) > at > org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103) > at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83) > at com.vitalmove.model.ALSExample.main(ALSExample.scala) > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can > not be empty. Parquet does not support empty group without leaves. Empty > group: spark_schema > at org.apache.parquet.schema.GroupType.(GroupType.java:92) > at org.apache.parquet.schema.GroupType.(GroupType.java:48) > at org.apache.parquet.schema.MessageType.(MessageType.java:50) > at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala) > -- This m
[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1, map2, ..., mapN) → map
[ https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434185#comment-16434185 ] Marek Novotny commented on SPARK-23936: --- Shouldn't we overload _concat_ function for maps instead of introducing _map_concat_? > High-order function: map_concat(map1, map2, ..., mapN) → > map > --- > > Key: SPARK-23936 > URL: https://issues.apache.org/jira/browse/SPARK-23936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns the union of all the given maps. If a key is found in multiple given > maps, that key’s value in the resulting map comes from the last one of those > maps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-23963: -- Description: TableReader gets disproportionately slower as the number of columns in the query increase. For example, reading a table with 6000 columns is 4 times more expensive per record than reading a table with 3000 columns, rather than twice as expensive. The increase in processing time is due to several Lists (fieldRefs, fieldOrdinals, and unwrappers), each of which the reader accesses by column number for each column in a record. Because each List has O(n) time for lookup by column number, these lookups grow increasingly expensive as the column count increases. When I patched the code to change those 3 Lists to Arrays, the query times became proportional. was: TableReader gets disproportionately slower as the number of columns in the query increase. For example, reading a table with 6000 columns is 4 times more expensive per record than reading a table with 3000 columns, rather than twice as expensive. The increase in processing time is due to several Lists (fieldRefs, fieldOrdinals, and unwrappers), each of which the reader accesses by column number for each column in a record. Because each List has O(n) time for lookup by column number, these lookups grow increasingly expensive as the column count increases. When I patched the code to change those 3 Lists to Arrays, the query times became proportional. > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O(n) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-23963: -- Description: TableReader gets disproportionately slower as the number of columns in the query increase. For example, reading a table with 6000 columns is 4 times more expensive per record than reading a table with 3000 columns, rather than twice as expensive. The increase in processing time is due to several Lists (fieldRefs, fieldOrdinals, and unwrappers), each of which the reader accesses by column number for each column in a record. Because each List has O\(n\) time for lookup by column number, these lookups grow increasingly expensive as the column count increases. When I patched the code to change those 3 Lists to Arrays, the query times became proportional. was: TableReader gets disproportionately slower as the number of columns in the query increase. For example, reading a table with 6000 columns is 4 times more expensive per record than reading a table with 3000 columns, rather than twice as expensive. The increase in processing time is due to several Lists (fieldRefs, fieldOrdinals, and unwrappers), each of which the reader accesses by column number for each column in a record. Because each List has O(n) time for lookup by column number, these lookups grow increasingly expensive as the column count increases. When I patched the code to change those 3 Lists to Arrays, the query times became proportional. > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
Bruce Robbins created SPARK-23963: - Summary: Queries on text-based Hive tables grow disproportionately slower as the number of columns increase Key: SPARK-23963 URL: https://issues.apache.org/jira/browse/SPARK-23963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Bruce Robbins TableReader gets disproportionately slower as the number of columns in the query increase. For example, reading a table with 6000 columns is 4 times more expensive per record than reading a table with 3000 columns, rather than twice as expensive. The increase in processing time is due to several Lists (fieldRefs, fieldOrdinals, and unwrappers), each of which the reader accesses by column number for each column in a record. Because each List has O(n) time for lookup by column number, these lookups grow increasingly expensive as the column count increases. When I patched the code to change those 3 Lists to Arrays, the query times became proportional. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds
[ https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434077#comment-16434077 ] Apache Spark commented on SPARK-23962: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/21041 > Flaky tests from SQLMetricsTestUtils.currentExecutionIds > > > Key: SPARK-23962 > URL: https://issues.apache.org/jira/browse/SPARK-23962 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Minor > Attachments: unit-tests.log > > > I've seen > {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin > metrics}} fail > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/ > > with > {noformat} > Error Message > org.scalatest.exceptions.TestFailedException: 2 did not equal 1 > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did > not equal 1 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260) > ... > {noformat} > I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is > racing with the listener bus. > I'll attach trimmed logs as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds
[ https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23962: Assignee: Apache Spark > Flaky tests from SQLMetricsTestUtils.currentExecutionIds > > > Key: SPARK-23962 > URL: https://issues.apache.org/jira/browse/SPARK-23962 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Minor > Attachments: unit-tests.log > > > I've seen > {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin > metrics}} fail > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/ > > with > {noformat} > Error Message > org.scalatest.exceptions.TestFailedException: 2 did not equal 1 > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did > not equal 1 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260) > ... > {noformat} > I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is > racing with the listener bus. > I'll attach trimmed logs as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds
[ https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23962: Assignee: (was: Apache Spark) > Flaky tests from SQLMetricsTestUtils.currentExecutionIds > > > Key: SPARK-23962 > URL: https://issues.apache.org/jira/browse/SPARK-23962 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Minor > Attachments: unit-tests.log > > > I've seen > {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin > metrics}} fail > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/ > > with > {noformat} > Error Message > org.scalatest.exceptions.TestFailedException: 2 did not equal 1 > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did > not equal 1 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260) > ... > {noformat} > I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is > racing with the listener bus. > I'll attach trimmed logs as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
[ https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-22941: Assignee: Marcelo Vanzin > Allow SparkSubmit to throw exceptions instead of exiting / printing errors. > --- > > Key: SPARK-22941 > URL: https://issues.apache.org/jira/browse/SPARK-22941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see > SPARK-11035). But if the caller provides incorrect or inconsistent parameters > to the app, {{SparkSubmit}} will print errors to the output and call > {{System.exit}}, which is not very user friendly in this code path. > We should modify {{SparkSubmit}} to be more friendly when called this way, > while still maintaining the old behavior when called from the command line. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
[ https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-22941. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20925 [https://github.com/apache/spark/pull/20925] > Allow SparkSubmit to throw exceptions instead of exiting / printing errors. > --- > > Key: SPARK-22941 > URL: https://issues.apache.org/jira/browse/SPARK-22941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see > SPARK-11035). But if the caller provides incorrect or inconsistent parameters > to the app, {{SparkSubmit}} will print errors to the output and call > {{System.exit}}, which is not very user friendly in this code path. > We should modify {{SparkSubmit}} to be more friendly when called this way, > while still maintaining the old behavior when called from the command line. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds
Imran Rashid created SPARK-23962: Summary: Flaky tests from SQLMetricsTestUtils.currentExecutionIds Key: SPARK-23962 URL: https://issues.apache.org/jira/browse/SPARK-23962 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 2.4.0 Reporter: Imran Rashid Attachments: unit-tests.log I've seen {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin metrics}} fail https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/ with {noformat} Error Message org.scalatest.exceptions.TestFailedException: 2 did not equal 1 Stacktrace sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did not equal 1 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) at org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33) at org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33) at org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188) at org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260) ... {noformat} I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is racing with the listener bus. I'll attach trimmed logs as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds
[ https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23962: - Attachment: unit-tests.log > Flaky tests from SQLMetricsTestUtils.currentExecutionIds > > > Key: SPARK-23962 > URL: https://issues.apache.org/jira/browse/SPARK-23962 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Minor > Attachments: unit-tests.log > > > I've seen > {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin > metrics}} fail > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/ > > with > {noformat} > Error Message > org.scalatest.exceptions.TestFailedException: 2 did not equal 1 > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did > not equal 1 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33) > at > org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188) > at > org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260) > ... > {noformat} > I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is > racing with the listener bus. > I'll attach trimmed logs as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23959) UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-23959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam De Backer updated SPARK-23959: -- Description: The following snippet works fine in Spark 2.2.1 but gives a rather cryptic runtime exception in Spark 2.3.0: {code:java} import sparkSession.implicits._ import org.apache.spark.sql.functions._ case class X(xid: Long, yid: Int) case class Y(yid: Int, zid: Long) case class Z(zid: Long, b: Boolean) val xs = Seq(X(1L, 10)).toDS() val ys = Seq(Y(10, 100L)).toDS() val zs = Seq.empty[Z].toDS() val j = xs .join(ys, "yid") .join(zs, Seq("zid"), "left") .withColumn("BAM", when('b, "B").otherwise("NB")) j.show(){code} In Spark 2.2.1 it prints to the console {noformat} +---+---+---++---+ |zid|yid|xid| b|BAM| +---+---+---++---+ |100| 10| 1|null| NB| +---+---+---++---+{noformat} In Spark 2.3.0 it results in: {noformat} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'BAM at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:296) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) ...{noformat} The culprit really seems to be DataSet being created from an empty Seq[Z]. When you change that to something that will also result in an empty DataSet[Z] it works as in Spark 2.2.1, e.g. {code:java} val zs = Seq(Z(10L, true)).toDS().filter('zid < Long.MinValue){code} was: The following snippet works fine in Spark 2.2.1 but gives a rather cryptic runtime exception in Spark 2.3.0: {code:java} import sparkSession.implicits._ import org.apache.spark.sql.functions._ case class X(xid: Long, yid: Int) case class Y(yid: Int, zid: Long) case class Z(zid: Long, b: Boolean) val xs = Seq(X(1L, 10)).toDS() val ys = Seq(Y(10, 100L)).toDS() val zs = Seq.empty[Z].toDS() val j = xs .join(ys, "yid") .join(zs, Seq("zid"), "left") .withColumn("BAM", when('b, "B").otherwise("NB")) j.show(){code} In Spark 2.2.1 it prints to the console {noformat} +---+---+---++---+ |zid|yid|xid| b|BAM| +---+---+---++---+ |100| 10| 1|null| NB| +---+---+---++---+{noformat} In Spark 2.3.0 it results in: {noformat} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'BAM at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:296) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) ...{noformat} The culprit really seems to be DataSet being created from an empty Seq[Z]. When you change that to something that will also result in an empty DataSet[Z] it works as in Spark 2.2.1, e.g. {code:java} val zs = Seq(Z(10L, true)).toDS().filter('zid === Long.MinValue){code} > UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0 > - > > Key: SPARK-23959 > URL: https://issues.apache.org/jira/browse/SPARK-23959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sam De Backer >Priority: Major > > The following snippet works fine in Spark 2.2.1 but gives a rather cryptic > runtime exception in Spark 2.3.0: > {code:java} > import sparkSession.implicits._ > import org.apache.spark.sql.functions._ > case class X(xid: Long, yid: Int) > case class Y(yid: Int, zid: Long) > case class Z(zid: Long, b: Boolean) > val xs = Seq(X(1L, 10)).toDS() > val ys = Seq(Y(10, 100L)).toDS() > val zs = Seq.empty[Z].toDS() > val
[jira] [Resolved] (SPARK-6951) History server slow startup if the event log directory is large
[ https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-6951. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20952 [https://github.com/apache/spark/pull/20952] > History server slow startup if the event log directory is large > --- > > Key: SPARK-6951 > URL: https://issues.apache.org/jira/browse/SPARK-6951 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Matt Cheah >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > I started my history server, then navigated to the web UI where I expected to > be able to view some completed applications, but the webpage was not > available. It turned out that the History Server was not finished parsing all > of the event logs in the event log directory that I had specified. I had > accumulated a lot of event logs from months of running Spark, so it would > have taken a very long time for the History Server to crunch through them > all. I purged the event log directory and started from scratch, and the UI > loaded immediately. > We should have a pagination strategy or parse the directory lazily to avoid > needing to wait after starting the history server. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6951) History server slow startup if the event log directory is large
[ https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-6951: --- Assignee: Marcelo Vanzin > History server slow startup if the event log directory is large > --- > > Key: SPARK-6951 > URL: https://issues.apache.org/jira/browse/SPARK-6951 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Matt Cheah >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > I started my history server, then navigated to the web UI where I expected to > be able to view some completed applications, but the webpage was not > available. It turned out that the History Server was not finished parsing all > of the event logs in the event log directory that I had specified. I had > accumulated a lot of event logs from months of running Spark, so it would > have taken a very long time for the History Server to crunch through them > all. I purged the event log directory and started from scratch, and the UI > loaded immediately. > We should have a pagination strategy or parse the directory lazily to avoid > needing to wait after starting the history server. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream
[ https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434015#comment-16434015 ] Tomasz Gawęda commented on SPARK-12105: --- +1, It's a quite common question on StackOverflow: "how to print show() result in custom PrintStream" > Add a DataFrame.show() with argument for output PrintStream > --- > > Key: SPARK-12105 > URL: https://issues.apache.org/jira/browse/SPARK-12105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: Dean Wampler >Priority: Minor > > It would be nice to send the output of DataFrame.show(...) to a different > output stream than stdout, including just capturing the string itself. This > is useful, e.g., for testing. Actually, it would be sufficient and perhaps > better to just make DataFrame.showString a public method, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream
[ https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12105: Assignee: Apache Spark > Add a DataFrame.show() with argument for output PrintStream > --- > > Key: SPARK-12105 > URL: https://issues.apache.org/jira/browse/SPARK-12105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: Dean Wampler >Assignee: Apache Spark >Priority: Minor > > It would be nice to send the output of DataFrame.show(...) to a different > output stream than stdout, including just capturing the string itself. This > is useful, e.g., for testing. Actually, it would be sufficient and perhaps > better to just make DataFrame.showString a public method, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream
[ https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12105: Assignee: (was: Apache Spark) > Add a DataFrame.show() with argument for output PrintStream > --- > > Key: SPARK-12105 > URL: https://issues.apache.org/jira/browse/SPARK-12105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: Dean Wampler >Priority: Minor > > It would be nice to send the output of DataFrame.show(...) to a different > output stream than stdout, including just capturing the string itself. This > is useful, e.g., for testing. Actually, it would be sufficient and perhaps > better to just make DataFrame.showString a public method, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23960) Mark HashAggregateExec.bufVars as transient
[ https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23960: --- Assignee: Kris Mok > Mark HashAggregateExec.bufVars as transient > --- > > Key: SPARK-23960 > URL: https://issues.apache.org/jira/browse/SPARK-23960 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Minor > Fix For: 2.4.0 > > > {{HashAggregateExec.bufVars}} is only used during codegen for global > aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is > on the stack. > Currently, if an {{HashAggregateExec}} is ever captured for serialization, > the {{bufVars}} would be needlessly serialized. > This ticket proposes a minor change to mark the {{bufVars}} field as > transient to avoid serializing it. Also, null out this field at the end of > {{doProduceWithoutKeys()}} to reduce its lifecycle so that the > {{Seq[ExprCode]}} being referenced can be GC'd sooner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23960) Mark HashAggregateExec.bufVars as transient
[ https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23960. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21039 [https://github.com/apache/spark/pull/21039] > Mark HashAggregateExec.bufVars as transient > --- > > Key: SPARK-23960 > URL: https://issues.apache.org/jira/browse/SPARK-23960 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Minor > Fix For: 2.4.0 > > > {{HashAggregateExec.bufVars}} is only used during codegen for global > aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is > on the stack. > Currently, if an {{HashAggregateExec}} is ever captured for serialization, > the {{bufVars}} would be needlessly serialized. > This ticket proposes a minor change to mark the {{bufVars}} field as > transient to avoid serializing it. Also, null out this field at the end of > {{doProduceWithoutKeys()}} to reduce its lifecycle so that the > {{Seq[ExprCode]}} being referenced can be GC'd sooner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23927) High-order function: sequence
[ https://issues.apache.org/jira/browse/SPARK-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433926#comment-16433926 ] Alex Wajda commented on SPARK-23927: I will take this one. Thanks. > High-order function: sequence > - > > Key: SPARK-23927 > URL: https://issues.apache.org/jira/browse/SPARK-23927 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > * sequence(start, stop) → array > Generate a sequence of integers from start to stop, incrementing by 1 if > start is less than or equal to stop, otherwise -1. > * sequence(start, stop, step) → array > Generate a sequence of integers from start to stop, incrementing by step. > * sequence(start, stop) → array > Generate a sequence of dates from start date to stop date, incrementing by 1 > day if start date is less than or equal to stop date, otherwise -1 day. > * sequence(start, stop, step) → array > Generate a sequence of dates from start to stop, incrementing by step. The > type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH. > * sequence(start, stop, step) → array > Generate a sequence of timestamps from start to stop, incrementing by step. > The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO > MONTH. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23961) pyspark toLocalIterator throws an exception
[ https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michel Lemay updated SPARK-23961: - Issue Type: Bug (was: Improvement) > pyspark toLocalIterator throws an exception > --- > > Key: SPARK-23961 > URL: https://issues.apache.org/jira/browse/SPARK-23961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Michel Lemay >Priority: Minor > Labels: DataFrame, pyspark > > Given a dataframe, take it's rdd and use toLocalIterator. If we do not > consume all records, it will throw: > {quote}ERROR PythonRDD: Error while sending iterator > java.net.SocketException: Connection reset by peer: socket write error > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at java.io.FilterOutputStream.write(FilterOutputStream.java:97) > at > org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706) > {quote} > > To reproduce, here is a simple pyspark shell script that show the error: > {quote}import itertools > df = spark.read.parquet("large parquet folder") > cachedRDD = df.rdd.cache() > print(cachedRDD.count()) # materialize > b = cachedRDD.toLocalIterator() > print(len(list(itertools.islice(b, 20 > b = None # Make the iterator goes out of scope. Throws here. > {quote} > > Observations: > * Consuming all records do not throw. Taking only a subset of the > partitions create the error. > * In another experiment, doing the same on a regular RDD works if we > cache/materialize it. If we do not cache the RDD, it throws similarly. > * It works in scala shell > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23961) pyspark toLocalIterator throws an exception
Michel Lemay created SPARK-23961: Summary: pyspark toLocalIterator throws an exception Key: SPARK-23961 URL: https://issues.apache.org/jira/browse/SPARK-23961 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.2.1 Reporter: Michel Lemay Given a dataframe, take it's rdd and use toLocalIterator. If we do not consume all records, it will throw: {quote}ERROR PythonRDD: Error while sending iterator java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) at java.net.SocketOutputStream.write(SocketOutputStream.java:155) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706) {quote} To reproduce, here is a simple pyspark shell script that show the error: {quote}import itertools df = spark.read.parquet("large parquet folder") cachedRDD = df.rdd.cache() print(cachedRDD.count()) # materialize b = cachedRDD.toLocalIterator() print(len(list(itertools.islice(b, 20 b = None # Make the iterator goes out of scope. Throws here. {quote} Observations: * Consuming all records do not throw. Taking only a subset of the partitions create the error. * In another experiment, doing the same on a regular RDD works if we cache/materialize it. If we do not cache the RDD, it throws similarly. * It works in scala shell -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23930) High-order function: slice(x, start, length) → array
[ https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23930: Assignee: Apache Spark > High-order function: slice(x, start, length) → array > > > Key: SPARK-23930 > URL: https://issues.apache.org/jira/browse/SPARK-23930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Subsets array x starting from index start (or starting from the end if start > is negative) with a length of length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23930) High-order function: slice(x, start, length) → array
[ https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433837#comment-16433837 ] Apache Spark commented on SPARK-23930: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21040 > High-order function: slice(x, start, length) → array > > > Key: SPARK-23930 > URL: https://issues.apache.org/jira/browse/SPARK-23930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Subsets array x starting from index start (or starting from the end if start > is negative) with a length of length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23930) High-order function: slice(x, start, length) → array
[ https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23930: Assignee: (was: Apache Spark) > High-order function: slice(x, start, length) → array > > > Key: SPARK-23930 > URL: https://issues.apache.org/jira/browse/SPARK-23930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Subsets array x starting from index start (or starting from the end if start > is negative) with a length of length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff
[ https://issues.apache.org/jira/browse/SPARK-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23951. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21026 [https://github.com/apache/spark/pull/21026] > Use java classed in ExprValue and simplify a bunch of stuff > --- > > Key: SPARK-23951 > URL: https://issues.apache.org/jira/browse/SPARK-23951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23960) Mark HashAggregateExec.bufVars as transient
[ https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23960: Assignee: Apache Spark > Mark HashAggregateExec.bufVars as transient > --- > > Key: SPARK-23960 > URL: https://issues.apache.org/jira/browse/SPARK-23960 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Minor > > {{HashAggregateExec.bufVars}} is only used during codegen for global > aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is > on the stack. > Currently, if an {{HashAggregateExec}} is ever captured for serialization, > the {{bufVars}} would be needlessly serialized. > This ticket proposes a minor change to mark the {{bufVars}} field as > transient to avoid serializing it. Also, null out this field at the end of > {{doProduceWithoutKeys()}} to reduce its lifecycle so that the > {{Seq[ExprCode]}} being referenced can be GC'd sooner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23960) Mark HashAggregateExec.bufVars as transient
[ https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433672#comment-16433672 ] Apache Spark commented on SPARK-23960: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/21039 > Mark HashAggregateExec.bufVars as transient > --- > > Key: SPARK-23960 > URL: https://issues.apache.org/jira/browse/SPARK-23960 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok >Priority: Minor > > {{HashAggregateExec.bufVars}} is only used during codegen for global > aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is > on the stack. > Currently, if an {{HashAggregateExec}} is ever captured for serialization, > the {{bufVars}} would be needlessly serialized. > This ticket proposes a minor change to mark the {{bufVars}} field as > transient to avoid serializing it. Also, null out this field at the end of > {{doProduceWithoutKeys()}} to reduce its lifecycle so that the > {{Seq[ExprCode]}} being referenced can be GC'd sooner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23960) Mark HashAggregateExec.bufVars as transient
[ https://issues.apache.org/jira/browse/SPARK-23960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23960: Assignee: (was: Apache Spark) > Mark HashAggregateExec.bufVars as transient > --- > > Key: SPARK-23960 > URL: https://issues.apache.org/jira/browse/SPARK-23960 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok >Priority: Minor > > {{HashAggregateExec.bufVars}} is only used during codegen for global > aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is > on the stack. > Currently, if an {{HashAggregateExec}} is ever captured for serialization, > the {{bufVars}} would be needlessly serialized. > This ticket proposes a minor change to mark the {{bufVars}} field as > transient to avoid serializing it. Also, null out this field at the end of > {{doProduceWithoutKeys()}} to reduce its lifecycle so that the > {{Seq[ExprCode]}} being referenced can be GC'd sooner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23960) Mark HashAggregateExec.bufVars as transient
Kris Mok created SPARK-23960: Summary: Mark HashAggregateExec.bufVars as transient Key: SPARK-23960 URL: https://issues.apache.org/jira/browse/SPARK-23960 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kris Mok {{HashAggregateExec.bufVars}} is only used during codegen for global aggregation. Specifically it's only used while {{doProduceWithoutKeys()}} is on the stack. Currently, if an {{HashAggregateExec}} is ever captured for serialization, the {{bufVars}} would be needlessly serialized. This ticket proposes a minor change to mark the {{bufVars}} field as transient to avoid serializing it. Also, null out this field at the end of {{doProduceWithoutKeys()}} to reduce its lifecycle so that the {{Seq[ExprCode]}} being referenced can be GC'd sooner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23959) UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0
Sam De Backer created SPARK-23959: - Summary: UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0 Key: SPARK-23959 URL: https://issues.apache.org/jira/browse/SPARK-23959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Sam De Backer The following snippet works fine in Spark 2.2.1 but gives a rather cryptic runtime exception in Spark 2.3.0: {code:java} import sparkSession.implicits._ import org.apache.spark.sql.functions._ case class X(xid: Long, yid: Int) case class Y(yid: Int, zid: Long) case class Z(zid: Long, b: Boolean) val xs = Seq(X(1L, 10)).toDS() val ys = Seq(Y(10, 100L)).toDS() val zs = Seq.empty[Z].toDS() val j = xs .join(ys, "yid") .join(zs, Seq("zid"), "left") .withColumn("BAM", when('b, "B").otherwise("NB")) j.show(){code} In Spark 2.2.1 it prints to the console {noformat} +---+---+---++---+ |zid|yid|xid| b|BAM| +---+---+---++---+ |100| 10| 1|null| NB| +---+---+---++---+{noformat} In Spark 2.3.0 it results in: {noformat} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'BAM at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:296) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) ...{noformat} The culprit really seems to be DataSet being created from an empty Seq[Z]. When you change that to something that will also result in an empty DataSet[Z] it works as in Spark 2.2.1, e.g. {code:java} val zs = Seq(Z(10L, true)).toDS().filter('zid === Long.MinValue){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23930) High-order function: slice(x, start, length) → array
[ https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433550#comment-16433550 ] Marco Gaido commented on SPARK-23930: - I am working on this. > High-order function: slice(x, start, length) → array > > > Key: SPARK-23930 > URL: https://issues.apache.org/jira/browse/SPARK-23930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Subsets array x starting from index start (or starting from the end if start > is negative) with a length of length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433536#comment-16433536 ] Jepson commented on SPARK-22968: [~apachespark] Thank you very much. > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala
[jira] [Assigned] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22968: Assignee: (was: Apache Spark) > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.s
[jira] [Assigned] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22968: Assignee: Apache Spark > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson >Assignee: Apache Spark >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) >
[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433510#comment-16433510 ] Apache Spark commented on SPARK-22968: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/21038 > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon