[jira] [Assigned] (SPARK-24966) Fix the precedence rule for set operations.
[ https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24966: Assignee: Apache Spark > Fix the precedence rule for set operations. > --- > > Key: SPARK-24966 > URL: https://issues.apache.org/jira/browse/SPARK-24966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Apache Spark >Priority: Major > > Currently the set operations INTERSECT, UNION and EXCEPT are assigned the > same precedence. We need to change to make sure INTERSECT is given higher > precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the > order they appear in the query from left to right. > Given this will result in a change in behavior, we need to keep it under a > config. > Here is a reference : > https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24966) Fix the precedence rule for set operations.
[ https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564772#comment-16564772 ] Apache Spark commented on SPARK-24966: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/21941 > Fix the precedence rule for set operations. > --- > > Key: SPARK-24966 > URL: https://issues.apache.org/jira/browse/SPARK-24966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Major > > Currently the set operations INTERSECT, UNION and EXCEPT are assigned the > same precedence. We need to change to make sure INTERSECT is given higher > precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the > order they appear in the query from left to right. > Given this will result in a change in behavior, we need to keep it under a > config. > Here is a reference : > https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24966) Fix the precedence rule for set operations.
[ https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24966: Assignee: (was: Apache Spark) > Fix the precedence rule for set operations. > --- > > Key: SPARK-24966 > URL: https://issues.apache.org/jira/browse/SPARK-24966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Major > > Currently the set operations INTERSECT, UNION and EXCEPT are assigned the > same precedence. We need to change to make sure INTERSECT is given higher > precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the > order they appear in the query from left to right. > Given this will result in a change in behavior, we need to keep it under a > config. > Here is a reference : > https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24951) Table valued functions should throw AnalysisException instead of IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-24951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24951. - Resolution: Fixed Fix Version/s: 2.4.0 > Table valued functions should throw AnalysisException instead of > IllegalArgumentException > - > > Key: SPARK-24951 > URL: https://issues.apache.org/jira/browse/SPARK-24951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > > When arguments don't match, TVFs currently throw IllegalArgumentException, > inconsistent with other functions, which throw AnalysisException. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file
[ https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-24311. -- Resolution: Won't Fix As I got feedback on [https://github.com/apache/spark/pull/21357#issuecomment-409427970] the patch is unlikely accepted. Closing this for now. > Refactor HDFSBackedStateStoreProvider to remove duplicated logic between > operations on delta file and snapshot file > --- > > Key: SPARK-24311 > URL: https://issues.apache.org/jira/browse/SPARK-24311 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jungtaek Lim >Priority: Major > > The structure of delta file and snapshot file is same, but the operations are > defined as individual methods which incurs duplicated logic between delta > file and snapshot file. > We can refactor to remove duplicated logic to ensure readability, as well as > guaranteeing that the structure of delta file and snapshot file keeps same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24893) Remove the entire CaseWhen if all the outputs are semantic equivalence
[ https://issues.apache.org/jira/browse/SPARK-24893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24893. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21852 [https://github.com/apache/spark/pull/21852] > Remove the entire CaseWhen if all the outputs are semantic equivalence > -- > > Key: SPARK-24893 > URL: https://issues.apache.org/jira/browse/SPARK-24893 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Similar to [SPARK-24890], if all the outputs of `CaseWhen` are semantic > equivalence, `CaseWhen` can be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24986) OOM in BufferHolder during writes to a stream
Sanket Reddy created SPARK-24986: Summary: OOM in BufferHolder during writes to a stream Key: SPARK-24986 URL: https://issues.apache.org/jira/browse/SPARK-24986 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0, 2.2.0, 2.1.0 Reporter: Sanket Reddy We have seen out of memory exception while running one of our prod jobs. We expect the memory allocation to be managed by unified memory manager during run time. So the buffer which is growing during write is somewhat like this if the rowlength is constant then the buffer does not grow… it keeps resetting and writing the values to the buffer… if the rows are variable and it is skewed and has huge stuff to be written this happens and i think the estimator which requests for initial execution memory does not account for this i think… Checking for underlying heap before growing the global buffer might be a viable option java.lang.OutOfMemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:232) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:221) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:159) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1075) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1129) at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:513) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:329) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1966) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:270) 18/06/11 21:18:41 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[stdout writer for Python/bin/python3.6,5,main] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24985) Executing SQL with "Full Outer Join" on top of large tables when there is data skew met OOM
sheperd huang created SPARK-24985: - Summary: Executing SQL with "Full Outer Join" on top of large tables when there is data skew met OOM Key: SPARK-24985 URL: https://issues.apache.org/jira/browse/SPARK-24985 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: sheperd huang When we run SQL with "Full Outer Join" on large tables when there is data skew, we found it's quite easy to hit OOM. We once thought we hit https://issues.apache.org/jira/browse/SPARK-13450. But taking a look at fix in [https://github.com/apache/spark/pull/16909,] we found that PR hasn't handled the "Full Outer Join" case. The root cause of the OOM is there are a lot of rows with the same key. See below code: {code:java} private def findMatchingRows(matchingKey: InternalRow): Unit = { leftMatches.clear() rightMatches.clear() leftIndex = 0 rightIndex = 0 while (leftRowKey != null && keyOrdering.compare(leftRowKey, matchingKey) == 0){ leftMatches += leftRow.copy() advancedLeft() } while (rightRowKey != null && keyOrdering.compare(rightRowKey, matchingKey) == 0) { rightMatches += rightRow.copy() advancedRight() } {code} It seems we haven't limited the data added to leftMatches and rightMatches. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564570#comment-16564570 ] Apache Spark commented on SPARK-23874: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/21939 > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564562#comment-16564562 ] Saisai Shao commented on SPARK-24615: - Leveraging dynamic allocation to tear down and bring up desired executor is a non-goal here, we will address it in feature, currently we're still focusing on using static allocation like spark.executor.gpus. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564562#comment-16564562 ] Saisai Shao edited comment on SPARK-24615 at 8/1/18 12:35 AM: -- Leveraging dynamic allocation to tear down and bring up desired executor is a non-goal here, we will address it in future, currently we're still focusing on using static allocation like spark.executor.gpus. was (Author: jerryshao): Leveraging dynamic allocation to tear down and bring up desired executor is a non-goal here, we will address it in feature, currently we're still focusing on using static allocation like spark.executor.gpus. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564558#comment-16564558 ] Erik Erlandson commented on SPARK-24615: Am I understanding correctly that this can't assign executors to desired resources without resorting to Dynamic Allocation to tear down an Executor and reallocate it somewhere else? > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24976) Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
[ https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-24976. -- Resolution: Fixed Fix Version/s: 2.3.2 2.4.0 Issue resolved by pull request 21928 [https://github.com/apache/spark/pull/21928] > Allow None for Decimal type conversion (specific to PyArrow 0.9.0) > -- > > Key: SPARK-24976 > URL: https://issues.apache.org/jira/browse/SPARK-24976 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0, 2.3.2 > > > See https://jira.apache.org/jira/browse/ARROW-2432 > If we use Arrow 0.9.0, the the test case (None as decimal) failed as below: > {code} > Traceback (most recent call last): > File "/.../spark/python/pyspark/sql/tests.py", line 4672, in > test_vectorized_udf_null_decimal > self.assertEquals(df.collect(), res.collect()) > File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect > sock_info = self._jdf.collectToPython() > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o51.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 > in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/.../spark/python/pyspark/worker.py", line 320, in main > process() > File "/.../spark/python/pyspark/worker.py", line 315, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream > batch = _create_batch(series, self._timezone) > File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch > arrs = [create_array(s, t) for s, t in series] > File "/.../spark/python/pyspark/serializers.py", line 241, in create_array > return pa.Array.from_pandas(s, mask=mask, type=t) > File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24976) Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
[ https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-24976: Assignee: Hyukjin Kwon > Allow None for Decimal type conversion (specific to PyArrow 0.9.0) > -- > > Key: SPARK-24976 > URL: https://issues.apache.org/jira/browse/SPARK-24976 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > See https://jira.apache.org/jira/browse/ARROW-2432 > If we use Arrow 0.9.0, the the test case (None as decimal) failed as below: > {code} > Traceback (most recent call last): > File "/.../spark/python/pyspark/sql/tests.py", line 4672, in > test_vectorized_udf_null_decimal > self.assertEquals(df.collect(), res.collect()) > File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect > sock_info = self._jdf.collectToPython() > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o51.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 > in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/.../spark/python/pyspark/worker.py", line 320, in main > process() > File "/.../spark/python/pyspark/worker.py", line 315, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream > batch = _create_batch(series, self._timezone) > File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch > arrs = [create_array(s, t) for s, t in series] > File "/.../spark/python/pyspark/serializers.py", line 241, in create_array > return pa.Array.from_pandas(s, mask=mask, type=t) > File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19394) "assertion failed: Expected hostname" on macOS when self-assigned IP contains a percent sign
[ https://issues.apache.org/jira/browse/SPARK-19394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564532#comment-16564532 ] Yuming Wang commented on SPARK-19394: - Try to add {{::1 localhost}} to /etc/hosts. > "assertion failed: Expected hostname" on macOS when self-assigned IP contains > a percent sign > > > Key: SPARK-19394 > URL: https://issues.apache.org/jira/browse/SPARK-19394 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Jacek Laskowski >Priority: Minor > > See [this question on > StackOverflow|http://stackoverflow.com/q/41914586/1305344]. > {quote} > So when I am not connected to internet, spark shell fails to load in local > mode. I am running Apache Spark 2.1.0 downloaded from internet, running on my > Mac. So I run ./bin/spark-shell and it gives me the error below. > So I have read the Spark code and it is using Java's > InetAddress.getLocalHost() to find the localhost's IP address. So when I am > connected to internet, I get back an IPv4 with my local hostname. > scala> InetAddress.getLocalHost > res9: java.net.InetAddress = AliKheyrollahis-MacBook-Pro.local/192.168.1.26 > but the key is, when disconnected, I get an IPv6 with a percentage in the > values (it is scoped): > scala> InetAddress.getLocalHost > res10: java.net.InetAddress = > AliKheyrollahis-MacBook-Pro.local/fe80:0:0:0:2b9a:4521:a301:e9a5%10 > And this IP is the same as the one you see in the error message. I feel my > problem is that it throws Spark since it cannot handle %10 in the result. > ... > 17/01/28 22:03:28 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://fe80:0:0:0:2b9a:4521:a301:e9a5%10:4040 > 17/01/28 22:03:28 INFO Executor: Starting executor ID driver on host localhost > 17/01/28 22:03:28 INFO Executor: Using REPL class URI: > spark://fe80:0:0:0:2b9a:4521:a301:e9a5%10:56107/classes > 17/01/28 22:03:28 ERROR SparkContext: Error initializing SparkContext. > java.lang.AssertionError: assertion failed: Expected hostname > at scala.Predef$.assert(Predef.scala:170) > at org.apache.spark.util.Utils$.checkHost(Utils.scala:931) > at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:31) > at org.apache.spark.executor.Executor.(Executor.scala:121) > at > org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156) > at org.apache.spark.SparkContext.(SparkContext.scala:509) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24977) input_file_name() result can't save and use for partitionBy()
[ https://issues.apache.org/jira/browse/SPARK-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564490#comment-16564490 ] kevin yu commented on SPARK-24977: -- Hello Srinivasarao: Can you show the steps you encountered the problem? I just did a quick test, seems work fine, but not sure it is the same as yours. scala> spark.read.textFile("file:///etc/passwd") res3: org.apache.spark.sql.Dataset[String] = [value: string] scala> res3.select(input_file_name() as "input", expr("10 as col2")).write.partitionBy("input").saveAsTable("passwd3") 18/07/31 16:11:59 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException scala> spark.sql("select * from passwd3").show ++--+ |col2| input| ++--+ | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| | 10|file:///etc/passwd| ++--+ only showing top 20 rows > input_file_name() result can't save and use for partitionBy() > - > > Key: SPARK-24977 > URL: https://issues.apache.org/jira/browse/SPARK-24977 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: Srinivasarao Padala >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24984) Spark Streaming with xml data
Kavya created SPARK-24984: - Summary: Spark Streaming with xml data Key: SPARK-24984 URL: https://issues.apache.org/jira/browse/SPARK-24984 Project: Spark Issue Type: New Feature Components: DStreams Affects Versions: 2.3.1 Reporter: Kavya Fix For: 0.8.2 Hi, We have Kinesis producer which reads xml data . But m currently unable to parse the xml using spark streaming. I'm able to read the data , but unable to parse into schema I'm able to read the data via spark.option("format","com..xml").load() which provides the schema. But I dont want to read via files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver
[ https://issues.apache.org/jira/browse/SPARK-24983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Vogelbacher updated SPARK-24983: -- Description: I noticed that writing a spark job that includes many sequential {{when-otherwise}} statements on the same column can easily OOM the driver while generating the optimized plan because the project node will grow exponentially in size. Example: {noformat} scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val df = Seq("a", "b", "c", "1").toDF("text") df: org.apache.spark.sql.DataFrame = [text: string] scala> var dfCaseWhen = df.filter($"text" =!= lit("0")) dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: string] scala> for( a <- 1 to 5) { | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === lit(a.toString), lit("r" + a.toString)).otherwise($"text")) | } scala> dfCaseWhen.queryExecution.analyzed res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14] +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12] +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10] +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8] +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6] +- Filter NOT (text#3 = 0) +- Project [value#1 AS text#3] +- LocalRelation [value#1] scala> dfCaseWhen.queryExecution.optimizedPlan res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va... {noformat} As one can see the optimized plan grows exponentially in the number of {{when-otherwise}} statements here. I can see that this comes from the {{CollapseProject}} optimizer rule. Maybe we should put a limit on the resulting size of the project node after collapsing and only collapse if we stay under the limit. was: Hi, I noticed that writing a spark job that includes many sequential when-otherwise statements on the same column can easily OOM the driver while generating the optimized plan because the project node will grow exponentially in size. Example: {noformat} scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val df = Seq("a", "b", "c", "1").toDF("text") df: org.apache.spark.sql.DataFrame = [text: string] scala> var dfCaseWhen = df.filter($"text" =!= lit("0")) dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: string] scala> for( a <- 1 to 5) { | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === lit(a.toString), lit("r" + a.toString)).otherwise($"text")) | } scala> dfCaseWhen.queryExecution.analyzed res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14] +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12] +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10] +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8] +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6] +- Filter NOT (text#3 = 0) +- Project [value#1 AS text#3] +- LocalRelation [value#1] scala> dfCaseWhen.queryExecution.optimizedPlan res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1)
[jira] [Created] (SPARK-24983) Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver
David Vogelbacher created SPARK-24983: - Summary: Collapsing multiple project statements with dependent When-Otherwise statements on the same column can OOM the driver Key: SPARK-24983 URL: https://issues.apache.org/jira/browse/SPARK-24983 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.3.1 Reporter: David Vogelbacher Hi, I noticed that writing a spark job that includes many sequential when-otherwise statements on the same column can easily OOM the driver while generating the optimized plan because the project node will grow exponentially in size. Example: {noformat} scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val df = Seq("a", "b", "c", "1").toDF("text") df: org.apache.spark.sql.DataFrame = [text: string] scala> var dfCaseWhen = df.filter($"text" =!= lit("0")) dfCaseWhen: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: string] scala> for( a <- 1 to 5) { | dfCaseWhen = dfCaseWhen.withColumn("text", when($"text" === lit(a.toString), lit("r" + a.toString)).otherwise($"text")) | } scala> dfCaseWhen.queryExecution.analyzed res6: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [CASE WHEN (text#12 = 5) THEN r5 ELSE text#12 END AS text#14] +- Project [CASE WHEN (text#10 = 4) THEN r4 ELSE text#10 END AS text#12] +- Project [CASE WHEN (text#8 = 3) THEN r3 ELSE text#8 END AS text#10] +- Project [CASE WHEN (text#6 = 2) THEN r2 ELSE text#6 END AS text#8] +- Project [CASE WHEN (text#3 = 1) THEN r1 ELSE text#3 END AS text#6] +- Filter NOT (text#3 = 0) +- Project [value#1 AS text#3] +- LocalRelation [value#1] scala> dfCaseWhen.queryExecution.optimizedPlan res5: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END END = 4) THEN r4 ELSE CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END = 3) THEN r3 ELSE CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END = 2) THEN r2 ELSE CASE WHEN (value#1 = 1) THEN r1 ELSE value#1 END END END END = 5) THEN r5 ELSE CASE WHEN (CASE WHEN (CASE WHEN (CASE WHEN (value#1 = 1) THEN r1 ELSE va... {noformat} As one can see the optimized plan grows exponentially in the number of {{when-otherwise}} statements here. I can see that this comes from the {{CollapseProject}} optimizer rule. Maybe we should put a limit on the resulting size of the project node after collapsing and only collapse if we stay under the limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24982: Assignee: Reynold Xin (was: Apache Spark) > UDAF resolution should not throw java.lang.AssertionError > - > > Key: SPARK-24982 > URL: https://issues.apache.org/jira/browse/SPARK-24982 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > See udaf.sql.out: > > {code:java} > – !query 3 > SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1 > – !query 3 schema > struct<> > – !query 3 output > java.lang.AssertionError > assertion failed: Incorrect number of children1{code} > > We should never throw AssertionError unless there is a bug in the system > itself. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564364#comment-16564364 ] Apache Spark commented on SPARK-24982: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/21938 > UDAF resolution should not throw java.lang.AssertionError > - > > Key: SPARK-24982 > URL: https://issues.apache.org/jira/browse/SPARK-24982 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > See udaf.sql.out: > > {code:java} > – !query 3 > SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1 > – !query 3 schema > struct<> > – !query 3 output > java.lang.AssertionError > assertion failed: Incorrect number of children1{code} > > We should never throw AssertionError unless there is a bug in the system > itself. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24982: Assignee: Apache Spark (was: Reynold Xin) > UDAF resolution should not throw java.lang.AssertionError > - > > Key: SPARK-24982 > URL: https://issues.apache.org/jira/browse/SPARK-24982 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Reynold Xin >Assignee: Apache Spark >Priority: Major > > See udaf.sql.out: > > {code:java} > – !query 3 > SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1 > – !query 3 schema > struct<> > – !query 3 output > java.lang.AssertionError > assertion failed: Incorrect number of children1{code} > > We should never throw AssertionError unless there is a bug in the system > itself. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-24982: --- Assignee: Reynold Xin > UDAF resolution should not throw java.lang.AssertionError > - > > Key: SPARK-24982 > URL: https://issues.apache.org/jira/browse/SPARK-24982 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > See udaf.sql.out: > > {code:java} > – !query 3 > SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1 > – !query 3 schema > struct<> > – !query 3 output > java.lang.AssertionError > assertion failed: Incorrect number of children1{code} > > We should never throw AssertionError unless there is a bug in the system > itself. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564362#comment-16564362 ] Ryan Blue edited comment on SPARK-24882 at 7/31/18 9:02 PM: {quote}the problem is then we need to make `CatalogSupport` a must-have for data sources instead of an optional plugin {quote} Data sources are read and write implementations. Catalog support should be a layer above read/write implementation that is used to provide CTAS and other table-level support. If you're interested in the anonymous table use case from the email discussion, I posted a suggestion there to add an {{anonymousTable}} function to {{DataSourceV2}}. That allows a source instantiated directly through v1-style reflection to provide a {{Table}} based on an options map. Then that table would implement {{ReadSupport}} and {{WriteSupport}} as I've suggested in this thread. That would preserve the ability to instantiate a source directly and use it, and would center around a {{Table}} that implements the read and write traits. An alternative to the {{anonymousTable}} method is what I did in the WIP pull request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: through the existing {{DataSourceV2Relation}} and through a new {{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement the read and write traits, while the latter is for {{Table}} objects that implement them. Either way works, though it would be cleaner to just use {{Table}}. Thanks for the builder update! Immutability is the most important part, but I'd still prefer a builder interface with default methods instead of the mix-in traits. was (Author: rdblue): {quote}the problem is then we need to make `CatalogSupport` a must-have for data sources instead of an optional plugin {quote} Data sources are read and write implementations. Catalog support should be a layer above read/write implementation that is used to provide CTAS and other table-level support. If you're interested in the anonymous table use case from the email discussion, I posted a suggestion there to add an {{anonymousTable}} function to {{DataSourceV2}}. That allows a source instantiated directly through v1-style reflection to provide a {{Table}} based on an options map. Then that table would implement {{ReadSupport}} and {{WriteSupport}} as I've suggested in this thread. That would preserve the ability to instantiate a source directly and use it, and would center around a {{Table}} that implements the read and write traits. An alternative to the {{anonymousTable}} method is what I did in the WIP pull request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: through the existing {{DataSourceV2Relation}} and through a new {{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement the read and write traits, while the latter is for {{Table}} objects that implement them. Either way works, though it would be cleaner to just use {{Table}}. Thanks for the builder update! Immutability is the most important part, but I'd still prefer a builder interface with default methods instead of the mix-in traits. > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 API, isolate the stateull part of the API, think of better naming > of some interfaces. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564362#comment-16564362 ] Ryan Blue commented on SPARK-24882: --- {quote}the problem is then we need to make `CatalogSupport` a must-have for data sources instead of an optional plugin {quote} Data sources are read and write implementations. Catalog support should be a layer above read/write implementation that is used to provide CTAS and other table-level support. If you're interested in the anonymous table use case from the email discussion, I posted a suggestion there to add an {{anonymousTable}} function to {{DataSourceV2}}. That allows a source instantiated directly through v1-style reflection to provide a {{Table}} based on an options map. Then that table would implement {{ReadSupport}} and {{WriteSupport}} as I've suggested in this thread. That would preserve the ability to instantiate a source directly and use it, and would center around a {{Table}} that implements the read and write traits. An alternative to the {{anonymousTable}} method is what I did in the WIP pull request for CTAS. In that PR, I created two ways to work with {{DataSourceV2}}: through the existing {{DataSourceV2Relation}} and through a new {{TableV2Relation}}. The first is for {{DataSourceV2}} instances that implement the read and write traits, while the latter is for {{Table}} objects that implement them. Either way works, though it would be cleaner to just use {{Table}}. Thanks for the builder update! Immutability is the most important part, but I'd still prefer a builder interface with default methods instead of the mix-in traits. > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 API, isolate the stateull part of the API, think of better naming > of some interfaces. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24973) Add numIter to Python ClusteringSummary
[ https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-24973: - Assignee: Huaxin Gao > Add numIter to Python ClusteringSummary > > > Key: SPARK-24973 > URL: https://issues.apache.org/jira/browse/SPARK-24973 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python > version of ClusteringSummary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18057: - Summary: Update structured streaming kafka from 0.10.0.1 to 2.0.0 (was: Update structured streaming kafka from 0.10.0.1 to 1.1.0) > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18057. -- Resolution: Fixed Assignee: Ted Yu Fix Version/s: 2.4.0 > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23914) High-order function: array_union(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564227#comment-16564227 ] Apache Spark commented on SPARK-23914: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21937 > High-order function: array_union(x, y) → array > -- > > Key: SPARK-23914 > URL: https://issues.apache.org/jira/browse/SPARK-23914 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of the elements in the union of x and y, without duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-24982: Description: See udaf.sql.out: {code:java} – !query 3 SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1 – !query 3 schema struct<> – !query 3 output java.lang.AssertionError assertion failed: Incorrect number of children1{code} We should never throw AssertionError unless there is a bug in the system itself. was: See udaf.sql.out: -- !query 3 SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1 -- !query 3 schema struct<> -- !query 3 output java.lang.AssertionError assertion failed: Incorrect number of children1 We should never throw AssertionError unless there is a bug in the system itself. > UDAF resolution should not throw java.lang.AssertionError > - > > Key: SPARK-24982 > URL: https://issues.apache.org/jira/browse/SPARK-24982 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Reynold Xin >Priority: Major > > See udaf.sql.out: > > {code:java} > – !query 3 > SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1 > – !query 3 schema > struct<> > – !query 3 output > java.lang.AssertionError > assertion failed: Incorrect number of children1{code} > > We should never throw AssertionError unless there is a bug in the system > itself. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24982) UDAF resolution should not throw java.lang.AssertionError
Reynold Xin created SPARK-24982: --- Summary: UDAF resolution should not throw java.lang.AssertionError Key: SPARK-24982 URL: https://issues.apache.org/jira/browse/SPARK-24982 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1, 2.3.0 Reporter: Reynold Xin See udaf.sql.out: -- !query 3 SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1 -- !query 3 schema struct<> -- !query 3 output java.lang.AssertionError assertion failed: Incorrect number of children1 We should never throw AssertionError unless there is a bug in the system itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program
[ https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564175#comment-16564175 ] Apache Spark commented on SPARK-24981: -- User 'hthuynh2' has created a pull request for this issue: https://github.com/apache/spark/pull/21936 > ShutdownHook timeout causes job to fail when succeeded when SparkContext > stop() not called by user program > -- > > Key: SPARK-24981 > URL: https://issues.apache.org/jira/browse/SPARK-24981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Hieu Tri Huynh >Priority: Minor > > When user does not stop the SparkContext at the end of their program, > ShutdownHookManger will stop the SparkContext. However, each shutdown hook is > only given 10s to run, it will be interrupted and cancelled after that given > time. In case stopping spark context takes longer than 10s, > InterruptedException will be thrown, and the job will fail even though it > succeeded before. An example of this is shown below. > I think there are a few ways to fix this, below are the 2 ways that I have > now: > 1. After user program finished, we can check if user program stoped > SparkContext or not. If user didn't stop the SparkContext, we can stop it > before finishing the userThread. By doing so, SparkContext.stop() can take as > much time as it needed. > 2. We can just catch the InterruptedException thrown by ShutdownHookManger > while we are stopping the SparkContext, and ignoring all the things that we > haven't stopped inside the SparkContext. Since we are shutting down, I think > it will be okay to ignore those things. > > {code:java} > 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1 > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1249) > at java.lang.Thread.join(Thread.java:1323) > at > org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) > at > org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1921) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, > java.util.concurrent.TimeoutException > java.util.concurrent.TimeoutException > at
[jira] [Assigned] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program
[ https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24981: Assignee: Apache Spark > ShutdownHook timeout causes job to fail when succeeded when SparkContext > stop() not called by user program > -- > > Key: SPARK-24981 > URL: https://issues.apache.org/jira/browse/SPARK-24981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Hieu Tri Huynh >Assignee: Apache Spark >Priority: Minor > > When user does not stop the SparkContext at the end of their program, > ShutdownHookManger will stop the SparkContext. However, each shutdown hook is > only given 10s to run, it will be interrupted and cancelled after that given > time. In case stopping spark context takes longer than 10s, > InterruptedException will be thrown, and the job will fail even though it > succeeded before. An example of this is shown below. > I think there are a few ways to fix this, below are the 2 ways that I have > now: > 1. After user program finished, we can check if user program stoped > SparkContext or not. If user didn't stop the SparkContext, we can stop it > before finishing the userThread. By doing so, SparkContext.stop() can take as > much time as it needed. > 2. We can just catch the InterruptedException thrown by ShutdownHookManger > while we are stopping the SparkContext, and ignoring all the things that we > haven't stopped inside the SparkContext. Since we are shutting down, I think > it will be okay to ignore those things. > > {code:java} > 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1 > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1249) > at java.lang.Thread.join(Thread.java:1323) > at > org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) > at > org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1921) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, > java.util.concurrent.TimeoutException > java.util.concurrent.TimeoutException > at java.util.concurrent.FutureTask.get(FutureTask.java:205) > at >
[jira] [Assigned] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program
[ https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24981: Assignee: (was: Apache Spark) > ShutdownHook timeout causes job to fail when succeeded when SparkContext > stop() not called by user program > -- > > Key: SPARK-24981 > URL: https://issues.apache.org/jira/browse/SPARK-24981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Hieu Tri Huynh >Priority: Minor > > When user does not stop the SparkContext at the end of their program, > ShutdownHookManger will stop the SparkContext. However, each shutdown hook is > only given 10s to run, it will be interrupted and cancelled after that given > time. In case stopping spark context takes longer than 10s, > InterruptedException will be thrown, and the job will fail even though it > succeeded before. An example of this is shown below. > I think there are a few ways to fix this, below are the 2 ways that I have > now: > 1. After user program finished, we can check if user program stoped > SparkContext or not. If user didn't stop the SparkContext, we can stop it > before finishing the userThread. By doing so, SparkContext.stop() can take as > much time as it needed. > 2. We can just catch the InterruptedException thrown by ShutdownHookManger > while we are stopping the SparkContext, and ignoring all the things that we > haven't stopped inside the SparkContext. Since we are shutting down, I think > it will be okay to ignore those things. > > {code:java} > 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1 > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1249) > at java.lang.Thread.join(Thread.java:1323) > at > org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) > at > org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1921) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, > java.util.concurrent.TimeoutException > java.util.concurrent.TimeoutException > at java.util.concurrent.FutureTask.get(FutureTask.java:205) > at >
[jira] [Commented] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564172#comment-16564172 ] Stavros Kontopoulos commented on SPARK-14643: - Cool thanks! > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Task > Components: Build, Project Infra >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148 ] Stavros Kontopoulos edited comment on SPARK-14643 at 7/31/18 6:38 PM: -- [~srowen] Right now there is no such change in the implementation. As noted in the description of the PR, we are not going to fix this for now, and as [~lrytz] mentions in the doc: "The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume this would happen in a major release that is also allowed to break binary compatibility, so then the conflicting overload could be removed." What is implemented is what is described in the action plan in the document. I though we reached an agreement there (as I see from the comments). If we were going to implement this there a couple of options mentioned in the doc: * A change to the Scala compiler, which we can do in 2.12.7 ([prototype here|https://github.com/scala/scala/compare/2.12.x...lrytz:synthetic?expand=1]). The Scala 2.11 compiler does not need to change. * A Scala compiler plugin could do the same for existing compiler releases * A post-processing step in the build could modify the class files using ASM The latter does not depend on the scala compiler but dont know the implications. was (Author: skonto): [~srowen] Right now there is no such change in the implementation. As noted in the description of the PR, we are not going to fix this for now, and as [~lrytz] mentions in the doc: "The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume this would happen in a major release that is also allowed to break binary compatibility, so then the conflicting overload could be removed." What is implemented is what is described in the action plan in the document. > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148 ] Stavros Kontopoulos edited comment on SPARK-14643 at 7/31/18 6:38 PM: -- [~srowen] Right now there is no such change in the implementation. As noted in the description of the PR, we are not going to fix this for now, and as [~lrytz] mentions in the doc: "The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume this would happen in a major release that is also allowed to break binary compatibility, so then the conflicting overload could be removed." What is implemented is what is described in the action plan in the document. I thought we reached an agreement there (as I see from the comments). If we were going to implement this there a couple of options mentioned in the doc: * A change to the Scala compiler, which we can do in 2.12.7 ([prototype here|https://github.com/scala/scala/compare/2.12.x...lrytz:synthetic?expand=1]). The Scala 2.11 compiler does not need to change. * A Scala compiler plugin could do the same for existing compiler releases * A post-processing step in the build could modify the class files using ASM The latter does not depend on the scala compiler but dont know the implications. was (Author: skonto): [~srowen] Right now there is no such change in the implementation. As noted in the description of the PR, we are not going to fix this for now, and as [~lrytz] mentions in the doc: "The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume this would happen in a major release that is also allowed to break binary compatibility, so then the conflicting overload could be removed." What is implemented is what is described in the action plan in the document. I though we reached an agreement there (as I see from the comments). If we were going to implement this there a couple of options mentioned in the doc: * A change to the Scala compiler, which we can do in 2.12.7 ([prototype here|https://github.com/scala/scala/compare/2.12.x...lrytz:synthetic?expand=1]). The Scala 2.11 compiler does not need to change. * A Scala compiler plugin could do the same for existing compiler releases * A post-processing step in the build could modify the class files using ASM The latter does not depend on the scala compiler but dont know the implications. > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24609) PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well
[ https://issues.apache.org/jira/browse/SPARK-24609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24609. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21788 [https://github.com/apache/spark/pull/21788] > PySpark/SparkR doc doesn't explain > RandomForestClassifier.featureSubsetStrategy well > > > Key: SPARK-24609 > URL: https://issues.apache.org/jira/browse/SPARK-24609 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiangrui Meng >Assignee: zhengruifeng >Priority: Major > Fix For: 2.4.0 > > > In Scala doc > ([https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier)], > we have: > > {quote}The number of features to consider for splits at each tree node. > Supported options: > * "auto": Choose automatically for task: If numTrees == 1, set to "all." If > numTrees > 1 (forest), set to "sqrt" for classification and to "onethird" for > regression. > * "all": use all features > * "onethird": use 1/3 of the features > * "sqrt": use sqrt(number of features) > * "log2": use log2(number of features) > * "n": when n is in the range (0, 1.0], use n * number of features. When n > is in the range (1, number of features), use n features. (default = "auto") > These various settings are based on the following references: > * log2: tested in Breiman (2001) > * sqrt: recommended by Breiman manual for random forests > * The defaults of sqrt (classification) and onethird (regression) match the > R randomForest package.{quote} > > The entire paragraph is missing in PySpark doc > ([https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier.featureSubsetStrategy]). > And same issue for SparkR > (https://github.com/apache/spark/blob/master/R/pkg/R/mllib_tree.R#L365). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24609) PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well
[ https://issues.apache.org/jira/browse/SPARK-24609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24609: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > PySpark/SparkR doc doesn't explain > RandomForestClassifier.featureSubsetStrategy well > > > Key: SPARK-24609 > URL: https://issues.apache.org/jira/browse/SPARK-24609 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiangrui Meng >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.4.0 > > > In Scala doc > ([https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier)], > we have: > > {quote}The number of features to consider for splits at each tree node. > Supported options: > * "auto": Choose automatically for task: If numTrees == 1, set to "all." If > numTrees > 1 (forest), set to "sqrt" for classification and to "onethird" for > regression. > * "all": use all features > * "onethird": use 1/3 of the features > * "sqrt": use sqrt(number of features) > * "log2": use log2(number of features) > * "n": when n is in the range (0, 1.0], use n * number of features. When n > is in the range (1, number of features), use n features. (default = "auto") > These various settings are based on the following references: > * log2: tested in Breiman (2001) > * sqrt: recommended by Breiman manual for random forests > * The defaults of sqrt (classification) and onethird (regression) match the > R randomForest package.{quote} > > The entire paragraph is missing in PySpark doc > ([https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier.featureSubsetStrategy]). > And same issue for SparkR > (https://github.com/apache/spark/blob/master/R/pkg/R/mllib_tree.R#L365). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24609) PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well
[ https://issues.apache.org/jira/browse/SPARK-24609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-24609: - Assignee: zhengruifeng > PySpark/SparkR doc doesn't explain > RandomForestClassifier.featureSubsetStrategy well > > > Key: SPARK-24609 > URL: https://issues.apache.org/jira/browse/SPARK-24609 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Xiangrui Meng >Assignee: zhengruifeng >Priority: Major > Fix For: 2.4.0 > > > In Scala doc > ([https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier)], > we have: > > {quote}The number of features to consider for splits at each tree node. > Supported options: > * "auto": Choose automatically for task: If numTrees == 1, set to "all." If > numTrees > 1 (forest), set to "sqrt" for classification and to "onethird" for > regression. > * "all": use all features > * "onethird": use 1/3 of the features > * "sqrt": use sqrt(number of features) > * "log2": use log2(number of features) > * "n": when n is in the range (0, 1.0], use n * number of features. When n > is in the range (1, number of features), use n features. (default = "auto") > These various settings are based on the following references: > * log2: tested in Breiman (2001) > * sqrt: recommended by Breiman manual for random forests > * The defaults of sqrt (classification) and onethird (regression) match the > R randomForest package.{quote} > > The entire paragraph is missing in PySpark doc > ([https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier.featureSubsetStrategy]). > And same issue for SparkR > (https://github.com/apache/spark/blob/master/R/pkg/R/mllib_tree.R#L365). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24980) add support for pandas/arrow etc for python2.7 and pypy builds
[ https://issues.apache.org/jira/browse/SPARK-24980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564154#comment-16564154 ] shane knapp commented on SPARK-24980: - [~hyukjin.kwon] thought you might be interested in this! ;) > add support for pandas/arrow etc for python2.7 and pypy builds > -- > > Key: SPARK-24980 > URL: https://issues.apache.org/jira/browse/SPARK-24980 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > since we have full support for python3.4 via anaconda, it's time to create > similar environments for 2.7 and pypy 2.5.1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148 ] Stavros Kontopoulos edited comment on SPARK-14643 at 7/31/18 6:34 PM: -- [~srowen] Right now there is no such change in the implementation. As noted in the description of the PR, we are not going to fix this for now, and as [~lrytz] mentions in the doc: "The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume this would happen in a major release that is also allowed to break binary compatibility, so then the conflicting overload could be removed." What is implemented is what is described in the action plan in the document. was (Author: skonto): [~srowen] Right now there is no such change in the implementation. As noted in the description of the PR, we are not going to fix this, and as [~lrytz] mentions in the doc: "The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume this would happen in a major release that is also allowed to break binary compatibility, so then the conflicting overload could be removed." What is implemented is what is described in the action plan in the document. > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148 ] Stavros Kontopoulos edited comment on SPARK-14643 at 7/31/18 6:34 PM: -- [~srowen] Right now there is no such change in the implementation. As noted in the description of the PR, we are not going to fix this, and as [~lrytz] mentions in the doc: "The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume this would happen in a major release that is also allowed to break binary compatibility, so then the conflicting overload could be removed." What is implemented is what is described in the action plan in the document. was (Author: skonto): [~srowen] Right now there is no such change in the implementation. As noted in the description of the PR, we are not going to fix this, as [~lrytz] said: "The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume this would happen in a major release that is also allowed to break binary compatibility, so then the conflicting overload could be removed." > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564148#comment-16564148 ] Stavros Kontopoulos commented on SPARK-14643: - [~srowen] Right now there is no such change in the implementation. As noted in the description of the PR, we are not going to fix this, as [~lrytz] said: "The proposed plan is to NOT do the `ACC_SYNTHETIC` for now. Java users would continue to use the 2.11 artifacts. If 2.11 support is ever dropped, I assume this would happen in a major release that is also allowed to break binary compatibility, so then the conflicting overload could be removed." > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24773) support reading AVRO logical types - Timestamp with different precisions
[ https://issues.apache.org/jira/browse/SPARK-24773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564145#comment-16564145 ] Apache Spark commented on SPARK-24773: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21935 > support reading AVRO logical types - Timestamp with different precisions > > > Key: SPARK-24773 > URL: https://issues.apache.org/jira/browse/SPARK-24773 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24773) support reading AVRO logical types - Timestamp with different precisions
[ https://issues.apache.org/jira/browse/SPARK-24773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24773: Assignee: (was: Apache Spark) > support reading AVRO logical types - Timestamp with different precisions > > > Key: SPARK-24773 > URL: https://issues.apache.org/jira/browse/SPARK-24773 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24773) support reading AVRO logical types - Timestamp with different precisions
[ https://issues.apache.org/jira/browse/SPARK-24773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24773: Assignee: Apache Spark > support reading AVRO logical types - Timestamp with different precisions > > > Key: SPARK-24773 > URL: https://issues.apache.org/jira/browse/SPARK-24773 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24973) Add numIter to Python ClusteringSummary
[ https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24973. --- Resolution: Duplicate I agree, though I think this should be thought of as one big issue. > Add numIter to Python ClusteringSummary > > > Key: SPARK-24973 > URL: https://issues.apache.org/jira/browse/SPARK-24973 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Minor > > -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python > version of ClusteringSummary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program
[ https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hieu Tri Huynh updated SPARK-24981: --- Priority: Minor (was: Major) > ShutdownHook timeout causes job to fail when succeeded when SparkContext > stop() not called by user program > -- > > Key: SPARK-24981 > URL: https://issues.apache.org/jira/browse/SPARK-24981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Hieu Tri Huynh >Priority: Minor > > When user does not stop the SparkContext at the end of their program, > ShutdownHookManger will stop the SparkContext. However, each shutdown hook is > only given 10s to run, it will be interrupted and cancelled after that given > time. In case stopping spark context takes longer than 10s, > InterruptedException will be thrown, and the job will fail even though it > succeeded before. An example of this is shown below. > I think there are a few ways to fix this, below are the 2 ways that I have > now: > 1. After user program finished, we can check if user program stoped > SparkContext or not. If user didn't stop the SparkContext, we can stop it > before finishing the userThread. By doing so, SparkContext.stop() can take as > much time as it needed. > 2. We can just catch the InterruptedException thrown by ShutdownHookManger > while we are stopping the SparkContext, and ignoring all the things that we > haven't stopped inside the SparkContext. Since we are shutting down, I think > it will be okay to ignore those things. > > {code:java} > 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1 > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1249) > at java.lang.Thread.join(Thread.java:1323) > at > org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) > at > org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1921) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, > java.util.concurrent.TimeoutException > java.util.concurrent.TimeoutException > at java.util.concurrent.FutureTask.get(FutureTask.java:205) > at >
[jira] [Updated] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program
[ https://issues.apache.org/jira/browse/SPARK-24981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hieu Tri Huynh updated SPARK-24981: --- Summary: ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program (was: ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called) > ShutdownHook timeout causes job to fail when succeeded when SparkContext > stop() not called by user program > -- > > Key: SPARK-24981 > URL: https://issues.apache.org/jira/browse/SPARK-24981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Hieu Tri Huynh >Priority: Major > > When user does not stop the SparkContext at the end of their program, > ShutdownHookManger will stop the SparkContext. However, each shutdown hook is > only given 10s to run, it will be interrupted and cancelled after that given > time. In case stopping spark context takes longer than 10s, > InterruptedException will be thrown, and the job will fail even though it > succeeded before. An example of this is shown below. > I think there are a few ways to fix this, below are the 2 ways that I have > now: > 1. After user program finished, we can check if user program stoped > SparkContext or not. If user didn't stop the SparkContext, we can stop it > before finishing the userThread. By doing so, SparkContext.stop() can take as > much time as it needed. > 2. We can just catch the InterruptedException thrown by ShutdownHookManger > while we are stopping the SparkContext, and ignoring all the things that we > haven't stopped inside the SparkContext. Since we are shutting down, I think > it will be okay to ignore those things. > > {code:java} > 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1 > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1249) > at java.lang.Thread.join(Thread.java:1323) > at > org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) > at > org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1921) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, >
[jira] [Created] (SPARK-24981) ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called
Hieu Tri Huynh created SPARK-24981: -- Summary: ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called Key: SPARK-24981 URL: https://issues.apache.org/jira/browse/SPARK-24981 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: Hieu Tri Huynh When user does not stop the SparkContext at the end of their program, ShutdownHookManger will stop the SparkContext. However, each shutdown hook is only given 10s to run, it will be interrupted and cancelled after that given time. In case stopping spark context takes longer than 10s, InterruptedException will be thrown, and the job will fail even though it succeeded before. An example of this is shown below. I think there are a few ways to fix this, below are the 2 ways that I have now: 1. After user program finished, we can check if user program stoped SparkContext or not. If user didn't stop the SparkContext, we can stop it before finishing the userThread. By doing so, SparkContext.stop() can take as much time as it needed. 2. We can just catch the InterruptedException thrown by ShutdownHookManger while we are stopping the SparkContext, and ignoring all the things that we haven't stopped inside the SparkContext. Since we are shutting down, I think it will be okay to ignore those things. {code:java} 18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1 java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1249) at java.lang.Thread.join(Thread.java:1323) at org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136) at org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) at org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360) at org.apache.spark.SparkContext.stop(SparkContext.scala:1921) at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24287) Spark -packages option should support classifier, no-transitive, and custom conf
[ https://issues.apache.org/jira/browse/SPARK-24287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24287: -- Issue Type: Improvement (was: Bug) This covers a lot of the same ground as https://issues.apache.org/jira/browse/SPARK-20075 > Spark -packages option should support classifier, no-transitive, and custom > conf > > > Key: SPARK-24287 > URL: https://issues.apache.org/jira/browse/SPARK-24287 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > We should extend Spark's -package option to support: > # Turn-off transitive dependency on a given artifact(like spark-avro) > # Resolving a given artifact with classifier (like avro-mapred-1.7.4-h2.jar > # Resolving a given artifact with custom ivy conf > # Excluding particular transitive dependencies from a given artifact. We > currently only have top-level exclusion rule applies for all artifacts. > We have tested this patch internally and it greatly increases the flexibility > when user uses -packages option -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24951) Table valued functions should throw AnalysisException instead of IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-24951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564075#comment-16564075 ] Apache Spark commented on SPARK-24951: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/21934 > Table valued functions should throw AnalysisException instead of > IllegalArgumentException > - > > Key: SPARK-24951 > URL: https://issues.apache.org/jira/browse/SPARK-24951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > When arguments don't match, TVFs currently throw IllegalArgumentException, > inconsistent with other functions, which throw AnalysisException. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24951) Table valued functions should throw AnalysisException instead of IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-24951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24951: Assignee: Reynold Xin (was: Apache Spark) > Table valued functions should throw AnalysisException instead of > IllegalArgumentException > - > > Key: SPARK-24951 > URL: https://issues.apache.org/jira/browse/SPARK-24951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > When arguments don't match, TVFs currently throw IllegalArgumentException, > inconsistent with other functions, which throw AnalysisException. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24951) Table valued functions should throw AnalysisException instead of IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-24951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24951: Assignee: Apache Spark (was: Reynold Xin) > Table valued functions should throw AnalysisException instead of > IllegalArgumentException > - > > Key: SPARK-24951 > URL: https://issues.apache.org/jira/browse/SPARK-24951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Apache Spark >Priority: Major > > When arguments don't match, TVFs currently throw IllegalArgumentException, > inconsistent with other functions, which throw AnalysisException. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24971) remove SupportsDeprecatedScanRow
[ https://issues.apache.org/jira/browse/SPARK-24971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24971: Issue Type: Sub-task (was: Improvement) Parent: SPARK-22386 > remove SupportsDeprecatedScanRow > > > Key: SPARK-24971 > URL: https://issues.apache.org/jira/browse/SPARK-24971 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrzej.stankev...@gmail.com updated SPARK-24974: - Description: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *report_date* and *type* and i have directory structure like {code:java} /custom_path/report_date=2018-07-24/type=A/file_1.parquet {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] was: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *report_date* and *type* and i have directory structure like {code:java} /custom_path/report_date=2018-07-24/type=A/file_1.parquet {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > --- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by *report_date* and *type* and i have directory > structure like > {code:java} > /custom_path/report_date=2018-07-24/type=A/file_1.parquet > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type *A* and it is just a couple of > files. But spark load all 19K of files from all partitions into > SharedInMemoryCache which takes about 60 secs and only after that throws > unused partitions. > > This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24980) add support for pandas/arrow etc for python2.7 and pypy builds
shane knapp created SPARK-24980: --- Summary: add support for pandas/arrow etc for python2.7 and pypy builds Key: SPARK-24980 URL: https://issues.apache.org/jira/browse/SPARK-24980 Project: Spark Issue Type: Improvement Components: Build, PySpark Affects Versions: 2.3.1 Reporter: shane knapp Assignee: shane knapp since we have full support for python3.4 via anaconda, it's time to create similar environments for 2.7 and pypy 2.5.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563998#comment-16563998 ] Sean Owen commented on SPARK-14643: --- [~skonto] is the conclusion that there's a change involving the ACC_SYNTHETIC flag that will resolve this issue? Seems OK to implement, to me, if it works. That would be the only other last item open for 2.12 support. > Remove overloaded methods which become ambiguous in Scala 2.12 > -- > > Key: SPARK-14643 > URL: https://issues.apache.org/jira/browse/SPARK-14643 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > > Spark 1.x's Dataset API runs into subtle source incompatibility problems for > Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a > nutshell, the current API has overloaded methods whose signatures are > ambiguous when resolving calls that use the Java 8 lambda syntax (only if > Spark is build against Scala 2.12). > This issue is somewhat subtle, so there's a full writeup at > https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing > which describes the exact circumstances under which the current APIs are > problematic. The writeup also proposes a solution which involves the removal > of certain overloads only in Scala 2.12 builds of Spark and the introduction > of implicit conversions for retaining source compatibility. > We don't need to implement any of these changes until we add Scala 2.12 > support since the changes must only be applied when building against Scala > 2.12 and will be done via traits + shims which are mixed in via > per-Scala-version source directories (like how we handle the > Scala-version-specific parts of the REPL). For now, this JIRA acts as a > placeholder so that the parent JIRA reflects the complete set of tasks which > need to be finished for 2.12 support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14540: -- Labels: release-notes (was: ) > Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner > > > Key: SPARK-14540 > URL: https://issues.apache.org/jira/browse/SPARK-14540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Josh Rosen >Priority: Major > Labels: release-notes > > Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running > ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures: > {code} > [info] - toplevel return statements in closures are identified at cleaning > time *** FAILED *** (32 milliseconds) > [info] Expected exception > org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no > exception was thrown. (ClosureCleanerSuite.scala:57) > {code} > and > {code} > [info] - user provided closures are actually cleaned *** FAILED *** (56 > milliseconds) > [info] Expected ReturnStatementInClosureException, but got > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: java.lang.Object > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class > org.apache.spark.util.TestUserClosuresActuallyCleaned$, > functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I, > implementation=invokeStatic > org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I, > instantiatedMethodType=(I)I, numCaptured=1]) > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, > functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=1]) > [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: > "f", type: "interface scala.Function3") > [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", > MapPartitionsRDD[2] at apply at Transformer.scala:22) > [info]- field (class "scala.Tuple2", name: "_1", type: "class > java.lang.Object") > [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at > apply at > Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)). > [info] This means the closure provided by user is not actually cleaned. > (ClosureCleanerSuite.scala:78) > {code} > We'll need to figure out a closure cleaning strategy which works for 2.12 > lambdas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24917) Sending a partition over netty results in 2x memory usage
[ https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24917: Assignee: (was: Apache Spark) > Sending a partition over netty results in 2x memory usage > - > > Key: SPARK-24917 > URL: https://issues.apache.org/jira/browse/SPARK-24917 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: Vincent >Priority: Major > > Hello > while investigating some OOM errors in Spark 2.2 [(here's my call > stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following > behavior happening, which I think is weird: > * a request happens to send a partition over network > * this partition is 1.9 GB and is persisted in memory > * this partition is apparently stored in a ByteBufferBlockData, that is made > of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each. > * the call to toNetty() is supposed to only wrap all the arrays and not > allocate any memory > * yet the call stack shows that netty is allocating memory and is trying to > consolidate all the chunks into one big 1.9GB array > * this means that at this point the memory footprint is 2x the size of the > actual partition (which is huge when the partition is 1.9GB) > Is this transient allocation expected? > After digging, it turns out that the actual copy is due to [this > method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260] > in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS > (16) components it will trigger a re-allocation of all the buffer. This netty > issue was fixed in this recent change : > [https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2] > > As a result, is it possible to benefit from this change somehow in spark 2.2 > and above? I don't know how the netty dependencies are handled for spark > > NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] > kinda changed the approach for spark 2.4 by bypassing netty buffer > altogether. However as it is written in the ticket, this approach *still* > needs to have the *entire* block serialized in memory, so this would be a > downgrade from fixing the netty issue when your buffer in < 2GB > > Thanks! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24917) Sending a partition over netty results in 2x memory usage
[ https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24917: Assignee: Apache Spark > Sending a partition over netty results in 2x memory usage > - > > Key: SPARK-24917 > URL: https://issues.apache.org/jira/browse/SPARK-24917 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: Vincent >Assignee: Apache Spark >Priority: Major > > Hello > while investigating some OOM errors in Spark 2.2 [(here's my call > stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following > behavior happening, which I think is weird: > * a request happens to send a partition over network > * this partition is 1.9 GB and is persisted in memory > * this partition is apparently stored in a ByteBufferBlockData, that is made > of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each. > * the call to toNetty() is supposed to only wrap all the arrays and not > allocate any memory > * yet the call stack shows that netty is allocating memory and is trying to > consolidate all the chunks into one big 1.9GB array > * this means that at this point the memory footprint is 2x the size of the > actual partition (which is huge when the partition is 1.9GB) > Is this transient allocation expected? > After digging, it turns out that the actual copy is due to [this > method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260] > in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS > (16) components it will trigger a re-allocation of all the buffer. This netty > issue was fixed in this recent change : > [https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2] > > As a result, is it possible to benefit from this change somehow in spark 2.2 > and above? I don't know how the netty dependencies are handled for spark > > NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] > kinda changed the approach for spark 2.4 by bypassing netty buffer > altogether. However as it is written in the ticket, this approach *still* > needs to have the *entire* block serialized in memory, so this would be a > downgrade from fixing the netty issue when your buffer in < 2GB > > Thanks! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24917) Sending a partition over netty results in 2x memory usage
[ https://issues.apache.org/jira/browse/SPARK-24917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563918#comment-16563918 ] Apache Spark commented on SPARK-24917: -- User 'vincent-grosbois' has created a pull request for this issue: https://github.com/apache/spark/pull/21933 > Sending a partition over netty results in 2x memory usage > - > > Key: SPARK-24917 > URL: https://issues.apache.org/jira/browse/SPARK-24917 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: Vincent >Priority: Major > > Hello > while investigating some OOM errors in Spark 2.2 [(here's my call > stack)|https://image.ibb.co/hHa2R8/sparkOOM.png], I find the following > behavior happening, which I think is weird: > * a request happens to send a partition over network > * this partition is 1.9 GB and is persisted in memory > * this partition is apparently stored in a ByteBufferBlockData, that is made > of a ChunkedByteBuffer, which is a list of (lots of) ByteBuffer of 4 MB each. > * the call to toNetty() is supposed to only wrap all the arrays and not > allocate any memory > * yet the call stack shows that netty is allocating memory and is trying to > consolidate all the chunks into one big 1.9GB array > * this means that at this point the memory footprint is 2x the size of the > actual partition (which is huge when the partition is 1.9GB) > Is this transient allocation expected? > After digging, it turns out that the actual copy is due to [this > method|https://github.com/netty/netty/blob/4.0/buffer/src/main/java/io/netty/buffer/Unpooled.java#L260] > in netty. If my initial buffer is made of more than DEFAULT_MAX_COMPONENTS > (16) components it will trigger a re-allocation of all the buffer. This netty > issue was fixed in this recent change : > [https://github.com/netty/netty/commit/9b95b8ee628983e3e4434da93fffb893edff4aa2] > > As a result, is it possible to benefit from this change somehow in spark 2.2 > and above? I don't know how the netty dependencies are handled for spark > > NB: it seems this ticket: [https://jira.apache.org/jira/browse/SPARK-24307] > kinda changed the approach for spark 2.4 by bypassing netty buffer > altogether. However as it is written in the ticket, this approach *still* > needs to have the *entire* block serialized in memory, so this would be a > downgrade from fixing the netty issue when your buffer in < 2GB > > Thanks! > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24979) add AnalysisHelper#resolveOperatorsUp
[ https://issues.apache.org/jira/browse/SPARK-24979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24979: Assignee: Apache Spark (was: Wenchen Fan) > add AnalysisHelper#resolveOperatorsUp > - > > Key: SPARK-24979 > URL: https://issues.apache.org/jira/browse/SPARK-24979 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24979) add AnalysisHelper#resolveOperatorsUp
[ https://issues.apache.org/jira/browse/SPARK-24979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563896#comment-16563896 ] Apache Spark commented on SPARK-24979: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21932 > add AnalysisHelper#resolveOperatorsUp > - > > Key: SPARK-24979 > URL: https://issues.apache.org/jira/browse/SPARK-24979 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24979) add AnalysisHelper#resolveOperatorsUp
[ https://issues.apache.org/jira/browse/SPARK-24979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24979: Assignee: Wenchen Fan (was: Apache Spark) > add AnalysisHelper#resolveOperatorsUp > - > > Key: SPARK-24979 > URL: https://issues.apache.org/jira/browse/SPARK-24979 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24979) add AnalysisHelper#resolveOperatorsUp
[ https://issues.apache.org/jira/browse/SPARK-24979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24979: Summary: add AnalysisHelper#resolveOperatorsUp (was: add resolveOperatorsUp) > add AnalysisHelper#resolveOperatorsUp > - > > Key: SPARK-24979 > URL: https://issues.apache.org/jira/browse/SPARK-24979 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24979) add resolveOperatorsUp
Wenchen Fan created SPARK-24979: --- Summary: add resolveOperatorsUp Key: SPARK-24979 URL: https://issues.apache.org/jira/browse/SPARK-24979 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError
[ https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24536. - Resolution: Fixed Fix Version/s: 2.4.0 2.3.2 > Query with nonsensical LIMIT hits AssertionError > > > Key: SPARK-24536 > URL: https://issues.apache.org/jira/browse/SPARK-24536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Alexander Behm >Priority: Trivial > Labels: beginner, spree > Fix For: 2.3.2, 2.4.0 > > > SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT) > fails in the QueryPlanner with: > {code} > java.lang.AssertionError: assertion failed: No plan for GlobalLimit null > {code} > I think this issue should be caught earlier during semantic analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563730#comment-16563730 ] Saisai Shao edited comment on SPARK-24615 at 7/31/18 2:16 PM: -- Hi [~tgraves], I think eval() might unnecessarily break the lineage which could execute in one stage, for example: data transforming -> training -> transforming, this could possibly run in one stage, using eval will break into several stages, I'm not sure if it is the good choice. Also if we use eval to break the lineage, how do we store the intermediate data, like shuffle, or in memory/ on disk? Yes, how to break the boundaries is hard for user to know, but currently I cannot figure out a good solution, unless we use eval() to explicitly separate them. To solve the conflicts, failing might be one choice. In the SQL or DF area, I don't think we have to expose such low level RDD APIs to user, maybe some hints should be enough (though I haven't thought about it). Currently in my design, withResources only applies to the stage in which RDD will be executed, the following stages will still be ordinary stages without additional resources. was (Author: jerryshao): Hi [~tgraves], I think eval() might unnecessarily break the lineage which could execute in one stage, for example: data transforming -> training -> transforming, this could possibly run in one stage, using eval will break into several stages, I'm not sure if it is the good choice. Also if we use eval to break the lineage, how do we store the intermediate data, like shuffle, or in memory/ on disk? Yes, how to break the boundaries is hard for user to know, but currently I cannot figure out a good solution, unless we use eval() to explicitly separate them. To solve the conflicts, failing might be one choice. Currently in my design, withResources only applies to the stage in which RDD will be executed, the following stages will still be ordinary stages without additional resouces. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563730#comment-16563730 ] Saisai Shao commented on SPARK-24615: - Hi [~tgraves], I think eval() might unnecessarily break the lineage which could execute in one stage, for example: data transforming -> training -> transforming, this could possibly run in one stage, using eval will break into several stages, I'm not sure if it is the good choice. Also if we use eval to break the lineage, how do we store the intermediate data, like shuffle, or in memory/ on disk? Yes, how to break the boundaries is hard for user to know, but currently I cannot figure out a good solution, unless we use eval() to explicitly separate them. To solve the conflicts, failing might be one choice. Currently in my design, withResources only applies to the stage in which RDD will be executed, the following stages will still be ordinary stages without additional resouces. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563711#comment-16563711 ] Thomas Graves commented on SPARK-24615: --- so I guess my question is this the right approach at all. Should we make it more obvious to the user where the boundaries would end by adding something like the eval(). I guess they could do like what they do for caching today which is force an action like count() to force the eval. If we are making the user add in their own action to evaluate, I would pick failing if there are conflicting resources. That way its completely obvious to the user that they might not be getting what they want. The only hard part is then though if the user doesn't know how to resolve that because of the way spark decides to pick its stages which might not be obvious to the users. Especially on the SQL side, but perhaps we aren't going to support that right now or support via hints? Its still not clear to me from the above are we saying the withResources applies to just the immediate next stage then, or it sticks until they change it again? > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563689#comment-16563689 ] Thomas Graves commented on SPARK-24579: --- going from Spark feeds data into DL/AI frameworks for training. section where we have an example using train() with tensorflow, def train(batches): # create a TensorFlow Dataset from Arrow batches (see next section) I assume the section that matches this would be the "Spark supports Tensor data types and TensorFlow computation graphs" section? It would be nice to perhaps finish that example to be more specific how that would work. Do you know how often the tensorflow api has changed and broken api or are there different versions with different api. It definitely seems safer to have this as some sort of plugin but I haven't looked at the specifics. Do we have examples how this would work with frameworks other then tensorflow? Sorry if I missed it but I haven't seen if we plan on supporting specific ones other then tensorflow. > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24816) SQL interface support repartitionByRange
[ https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-24816. - Resolution: Won't Fix {{Order by}} is implement by {{rangepartitioning}}. > SQL interface support repartitionByRange > > > Key: SPARK-24816 > URL: https://issues.apache.org/jira/browse/SPARK-24816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: DISTRIBUTE_BY_SORT_BY.png, > RANGE_DISTRIBUTE_BY_SORT_BY.png > > > SQL interface support {{repartitionByRange}} to improvement data pushdown. I > have test this feature with a big table(data size: 1.1 T, row count: > 282,001,954,428) . > The test sql is: > {code:sql} > select * from table where id=401564838907 > {code} > The test result: > |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation > MB-seconds| > |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086| > |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846| > |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620| > |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774| > |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297| > |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.
[ https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563487#comment-16563487 ] Apache Spark commented on SPARK-24978: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/21931 > Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity > of fast aggregation. > - > > Key: SPARK-24978 > URL: https://issues.apache.org/jira/browse/SPARK-24978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: caoxuewen >Priority: Major > > this pr add a configuration parameter to configure the capacity of fast > aggregation. > Performance comparison: > /* > Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 > Intel64 Family 6 Model 94 Stepping 3, GenuineIntel > Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) > Per Row(ns) Relative > > > fasthash = default 5612 / 5882 3.7 > 267.6 1.0X > fasthash = config 3586 / 3595 5.8 > 171.0 1.6X > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.
[ https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24978: Assignee: (was: Apache Spark) > Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity > of fast aggregation. > - > > Key: SPARK-24978 > URL: https://issues.apache.org/jira/browse/SPARK-24978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: caoxuewen >Priority: Major > > this pr add a configuration parameter to configure the capacity of fast > aggregation. > Performance comparison: > /* > Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 > Intel64 Family 6 Model 94 Stepping 3, GenuineIntel > Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) > Per Row(ns) Relative > > > fasthash = default 5612 / 5882 3.7 > 267.6 1.0X > fasthash = config 3586 / 3595 5.8 > 171.0 1.6X > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.
[ https://issues.apache.org/jira/browse/SPARK-24978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24978: Assignee: Apache Spark > Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity > of fast aggregation. > - > > Key: SPARK-24978 > URL: https://issues.apache.org/jira/browse/SPARK-24978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: caoxuewen >Assignee: Apache Spark >Priority: Major > > this pr add a configuration parameter to configure the capacity of fast > aggregation. > Performance comparison: > /* > Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 > Intel64 Family 6 Model 94 Stepping 3, GenuineIntel > Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) > Per Row(ns) Relative > > > fasthash = default 5612 / 5882 3.7 > 267.6 1.0X > fasthash = config 3586 / 3595 5.8 > 171.0 1.6X > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.
caoxuewen created SPARK-24978: - Summary: Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation. Key: SPARK-24978 URL: https://issues.apache.org/jira/browse/SPARK-24978 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: caoxuewen this pr add a configuration parameter to configure the capacity of fast aggregation. Performance comparison: /* Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative fasthash = default 5612 / 5882 3.7 267.6 1.0X fasthash = config 3586 / 3595 5.8 171.0 1.6X */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24968) Configurable Chunksize in ChunkedByteBufferOutputStream
[ https://issues.apache.org/jira/browse/SPARK-24968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent resolved SPARK-24968. - Resolution: Fixed > Configurable Chunksize in ChunkedByteBufferOutputStream > --- > > Key: SPARK-24968 > URL: https://issues.apache.org/jira/browse/SPARK-24968 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2, 2.3.0, 2.4.0 >Reporter: Vincent >Priority: Minor > > Hello, > it seems that when creating a _ChunkedByteBufferOutputStream,_ the chunk size > is always configured to be 4MB. I suggest we make it configurable via spark > conf. This would allow to solve issues like SPARK-24917 (by increasing the > chunk size to a bigger value in the conf). > What do you think ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24968) Configurable Chunksize in ChunkedByteBufferOutputStream
[ https://issues.apache.org/jira/browse/SPARK-24968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563467#comment-16563467 ] Vincent commented on SPARK-24968: - indeed they are closely related I'll close this ticket > Configurable Chunksize in ChunkedByteBufferOutputStream > --- > > Key: SPARK-24968 > URL: https://issues.apache.org/jira/browse/SPARK-24968 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2, 2.3.0, 2.4.0 >Reporter: Vincent >Priority: Minor > > Hello, > it seems that when creating a _ChunkedByteBufferOutputStream,_ the chunk size > is always configured to be 4MB. I suggest we make it configurable via spark > conf. This would allow to solve issues like SPARK-24917 (by increasing the > chunk size to a bigger value in the conf). > What do you think ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24946) PySpark - Allow np.Arrays and pd.Series in df.approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-24946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563455#comment-16563455 ] Paul Westenthanner commented on SPARK-24946: Sure, go ahead :) > PySpark - Allow np.Arrays and pd.Series in df.approxQuantile > > > Key: SPARK-24946 > URL: https://issues.apache.org/jira/browse/SPARK-24946 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Paul Westenthanner >Priority: Minor > Labels: DataFrame, beginner, pyspark > > As Python user it is convenient to pass a numpy array or pandas series > `{{approxQuantile}}(_col_, _probabilities_, _relativeError_)` for the > probabilities parameter. > > Especially for creating cumulative plots (say in 1% steps) it is handy to use > `approxQuantile(col, np.arange(0, 1.0, 0.01), relativeError)`. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563434#comment-16563434 ] Apache Spark commented on SPARK-14540: -- User 'skonto' has created a pull request for this issue: https://github.com/apache/spark/pull/21930 > Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner > > > Key: SPARK-14540 > URL: https://issues.apache.org/jira/browse/SPARK-14540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Josh Rosen >Priority: Major > > Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running > ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures: > {code} > [info] - toplevel return statements in closures are identified at cleaning > time *** FAILED *** (32 milliseconds) > [info] Expected exception > org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no > exception was thrown. (ClosureCleanerSuite.scala:57) > {code} > and > {code} > [info] - user provided closures are actually cleaned *** FAILED *** (56 > milliseconds) > [info] Expected ReturnStatementInClosureException, but got > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: java.lang.Object > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class > org.apache.spark.util.TestUserClosuresActuallyCleaned$, > functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I, > implementation=invokeStatic > org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I, > instantiatedMethodType=(I)I, numCaptured=1]) > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, > functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=1]) > [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: > "f", type: "interface scala.Function3") > [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", > MapPartitionsRDD[2] at apply at Transformer.scala:22) > [info]- field (class "scala.Tuple2", name: "_1", type: "class > java.lang.Object") > [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at > apply at > Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)). > [info] This means the closure provided by user is not actually cleaned. > (ClosureCleanerSuite.scala:78) > {code} > We'll need to figure out a closure cleaning strategy which works for 2.12 > lambdas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24977) input_file_name() result can't save and use for partitionBy()
[ https://issues.apache.org/jira/browse/SPARK-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563424#comment-16563424 ] Srinivasarao Padala commented on SPARK-24977: - i could able to see the the filenames by df.show() , but aftersaving to file , it is showing empty and not able to use for partitionBy() while saving . getting Error - : java.lang.AssertionError: assertion failed: Empty partition column value in 'filename=' > input_file_name() result can't save and use for partitionBy() > - > > Key: SPARK-24977 > URL: https://issues.apache.org/jira/browse/SPARK-24977 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: Srinivasarao Padala >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24977) input_file_name() result can't save and use for paritionBy()
Srinivasarao Padala created SPARK-24977: --- Summary: input_file_name() result can't save and use for paritionBy() Key: SPARK-24977 URL: https://issues.apache.org/jira/browse/SPARK-24977 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, SQL Affects Versions: 2.3.1 Reporter: Srinivasarao Padala -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24977) input_file_name() result can't save and use for partitionBy()
[ https://issues.apache.org/jira/browse/SPARK-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivasarao Padala updated SPARK-24977: Summary: input_file_name() result can't save and use for partitionBy() (was: input_file_name() result can't save and use for paritionBy()) > input_file_name() result can't save and use for partitionBy() > - > > Key: SPARK-24977 > URL: https://issues.apache.org/jira/browse/SPARK-24977 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.3.1 >Reporter: Srinivasarao Padala >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24968) Configurable Chunksize in ChunkedByteBufferOutputStream
[ https://issues.apache.org/jira/browse/SPARK-24968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563274#comment-16563274 ] Hyukjin Kwon commented on SPARK-24968: -- So is it a duplicate of SPARK-24917 roughly? I suggest to leave this resolved if that or this issue is a subset > Configurable Chunksize in ChunkedByteBufferOutputStream > --- > > Key: SPARK-24968 > URL: https://issues.apache.org/jira/browse/SPARK-24968 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2, 2.3.0, 2.4.0 >Reporter: Vincent >Priority: Minor > > Hello, > it seems that when creating a _ChunkedByteBufferOutputStream,_ the chunk size > is always configured to be 4MB. I suggest we make it configurable via spark > conf. This would allow to solve issues like SPARK-24917 (by increasing the > chunk size to a bigger value in the conf). > What do you think ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24949) pyspark.sql.Column breaks the iterable contract
[ https://issues.apache.org/jira/browse/SPARK-24949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24949. -- Resolution: Won't Fix Let me leave this resolved. Seems difficult to fix but I wonder if it's worth. Please reopen and go ahead if a PR is ready. Let me leave this resolve till then. > pyspark.sql.Column breaks the iterable contract > --- > > Key: SPARK-24949 > URL: https://issues.apache.org/jira/browse/SPARK-24949 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Daniel Shields >Priority: Minor > > pyspark.sql.Column implements __iter__ just to raise a TypeError: > {code:java} > def __iter__(self): > raise TypeError("Column is not iterable") > {code} > This makes column look iterable even when it isn't: > {code:java} > isinstance(mycolumn, collections.Iterable) # Evaluates to True{code} > This function should be removed from Column completely so it behaves like > every other non-iterable class. > For further motivation of why this should be fixed, consider the below > example, which currently requires listing Column explicitly: > {code:java} > def listlike(value): > # Column unfortunately implements __iter__ just to raise a TypeError. > # This breaks the iterable contract and should be fixed in Spark proper. > return isinstance(value, collections.Iterable) and not isinstance(value, > (str, bytes, pyspark.sql.Column)) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException
[ https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563267#comment-16563267 ] Apache Spark commented on SPARK-24970: -- User 'brucezhao11' has created a pull request for this issue: https://github.com/apache/spark/pull/21929 > Spark Kinesis streaming application fails to recover from streaming > checkpoint due to ProvisionedThroughputExceededException > > > Key: SPARK-24970 > URL: https://issues.apache.org/jira/browse/SPARK-24970 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: bruce_zhao >Priority: Major > Labels: kinesis > > > We're using Spark streaming to consume Kinesis data, and found that it reads > more data from Kinesis and is easy to touch > ProvisionedThroughputExceededException *when it recovers from streaming > checkpoint*. > > Normally, it's a WARN in spark log. But when we have multiple streaming > applications (i.e., 5 applications) to consume the same Kinesis stream, the > situation becomes serious. *The application will fail to recover due to the > following exception in driver.* And one application failure will also affect > the other running applications. > > {panel:title=Exception} > org.apache.spark.SparkException: Job aborted due to stage failure: > {color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent > failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, > ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): > org.apache.spark.SparkException: Gave up after 3 retries while getting > records using shard iterator, last exception: at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at > org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) > Caused by: > *{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: > Rate exceeded for shard shardId- in stream rellfsstream-an under > account 1{color}.* (Service: AmazonKinesis; Status Code: 400; > Error Code: ProvisionedThroughputExceededException; Request ID: > d3520677-060e-14c4-8014-2886b6b75f03) at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647) > at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at > com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219) > at > com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195) > at > com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004) > at >
[jira] [Assigned] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException
[ https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24970: Assignee: (was: Apache Spark) > Spark Kinesis streaming application fails to recover from streaming > checkpoint due to ProvisionedThroughputExceededException > > > Key: SPARK-24970 > URL: https://issues.apache.org/jira/browse/SPARK-24970 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: bruce_zhao >Priority: Major > Labels: kinesis > > > We're using Spark streaming to consume Kinesis data, and found that it reads > more data from Kinesis and is easy to touch > ProvisionedThroughputExceededException *when it recovers from streaming > checkpoint*. > > Normally, it's a WARN in spark log. But when we have multiple streaming > applications (i.e., 5 applications) to consume the same Kinesis stream, the > situation becomes serious. *The application will fail to recover due to the > following exception in driver.* And one application failure will also affect > the other running applications. > > {panel:title=Exception} > org.apache.spark.SparkException: Job aborted due to stage failure: > {color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent > failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, > ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): > org.apache.spark.SparkException: Gave up after 3 retries while getting > records using shard iterator, last exception: at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at > org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) > Caused by: > *{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: > Rate exceeded for shard shardId- in stream rellfsstream-an under > account 1{color}.* (Service: AmazonKinesis; Status Code: 400; > Error Code: ProvisionedThroughputExceededException; Request ID: > d3520677-060e-14c4-8014-2886b6b75f03) at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647) > at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at > com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219) > at > com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195) > at > com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004) > at > com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980) > at >
[jira] [Assigned] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException
[ https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24970: Assignee: Apache Spark > Spark Kinesis streaming application fails to recover from streaming > checkpoint due to ProvisionedThroughputExceededException > > > Key: SPARK-24970 > URL: https://issues.apache.org/jira/browse/SPARK-24970 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: bruce_zhao >Assignee: Apache Spark >Priority: Major > Labels: kinesis > > > We're using Spark streaming to consume Kinesis data, and found that it reads > more data from Kinesis and is easy to touch > ProvisionedThroughputExceededException *when it recovers from streaming > checkpoint*. > > Normally, it's a WARN in spark log. But when we have multiple streaming > applications (i.e., 5 applications) to consume the same Kinesis stream, the > situation becomes serious. *The application will fail to recover due to the > following exception in driver.* And one application failure will also affect > the other running applications. > > {panel:title=Exception} > org.apache.spark.SparkException: Job aborted due to stage failure: > {color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent > failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, > ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): > org.apache.spark.SparkException: Gave up after 3 retries while getting > records using shard iterator, last exception: at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162) > at > org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at > org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) > Caused by: > *{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: > Rate exceeded for shard shardId- in stream rellfsstream-an under > account 1{color}.* (Service: AmazonKinesis; Status Code: 400; > Error Code: ProvisionedThroughputExceededException; Request ID: > d3520677-060e-14c4-8014-2886b6b75f03) at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647) > at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at > com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219) > at > com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195) > at > com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004) > at > com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980) >
[jira] [Resolved] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404
[ https://issues.apache.org/jira/browse/SPARK-24975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24975. - Resolution: Duplicate > Spark history server REST API /api/v1/version returns error 404 > --- > > Key: SPARK-24975 > URL: https://issues.apache.org/jira/browse/SPARK-24975 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: shanyu zhao >Priority: Major > > Spark history server REST API provides /api/v1/version, according to doc: > [https://spark.apache.org/docs/latest/monitoring.html] > However, for Spark 2.3, we see: > {code:java} > curl http://localhost:18080/api/v1/version > > > > Error 404 Not Found > > HTTP ERROR 404 > Problem accessing /api/v1/version. Reason: > Not Foundhttp://eclipse.org/jetty;>Powered by > Jetty:// 9.3.z-SNAPSHOT > > {code} > On a Spark 2.2 cluster, we see: > {code:java} > curl http://localhost:18080/api/v1/version > { > "spark" : "2.2.0" > }{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404
[ https://issues.apache.org/jira/browse/SPARK-24975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563259#comment-16563259 ] Marco Gaido commented on SPARK-24975: - This seems a duplicate of SPARK-24188. Despite here I see that 2.3.1 is affected, while this should not be the case according to SPARK-24188. May you please check if 2.3.1 is actually affected and if not close this as duplicate? Thanks. > Spark history server REST API /api/v1/version returns error 404 > --- > > Key: SPARK-24975 > URL: https://issues.apache.org/jira/browse/SPARK-24975 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: shanyu zhao >Priority: Major > > Spark history server REST API provides /api/v1/version, according to doc: > [https://spark.apache.org/docs/latest/monitoring.html] > However, for Spark 2.3, we see: > {code:java} > curl http://localhost:18080/api/v1/version > > > > Error 404 Not Found > > HTTP ERROR 404 > Problem accessing /api/v1/version. Reason: > Not Foundhttp://eclipse.org/jetty;>Powered by > Jetty:// 9.3.z-SNAPSHOT > > {code} > On a Spark 2.2 cluster, we see: > {code:java} > curl http://localhost:18080/api/v1/version > { > "spark" : "2.2.0" > }{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true
[ https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563256#comment-16563256 ] Quentin Ambard commented on SPARK-24720: Maybe we could improve that. As we already have to iterate over the records to count the number of entries for the inputInfoTracker, we could do a single iteration where we get at the same time the count and the last offset. > kafka transaction creates Non-consecutive Offsets (due to transaction offset) > making streaming fail when failOnDataLoss=true > > > Key: SPARK-24720 > URL: https://issues.apache.org/jira/browse/SPARK-24720 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Priority: Major > > When kafka transactions are used, sending 1 message to kafka will result to 1 > offset for the data + 1 offset to mark the transaction. > When kafka connector for spark streaming read a topic with non-consecutive > offset, it leads to a failure. SPARK-17147 fixed this issue for compacted > topics. > However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 > message + 1 transaction commit are in a partition, spark will try to read > offsets [0 2[. offset 0 (containing the message) will be read, but offset 1 > won't return a value and buffer.hasNext() will be false even after a poll > since no data are present for offset 1 (it's the transaction commit) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24972) PivotFirst could not handle pivot columns of complex types
[ https://issues.apache.org/jira/browse/SPARK-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24972. - Resolution: Fixed Assignee: Maryann Xue Fix Version/s: 2.4.0 > PivotFirst could not handle pivot columns of complex types > -- > > Key: SPARK-24972 > URL: https://issues.apache.org/jira/browse/SPARK-24972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 2.4.0 > > > {{PivotFirst}} did not handle complex types for pivot columns properly. And > as a result, the pivot column could not be matched with any pivot value and > it always returned empty result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org