[jira] [Resolved] (SPARK-24972) PivotFirst could not handle pivot columns of complex types
[ https://issues.apache.org/jira/browse/SPARK-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24972. - Resolution: Fixed Assignee: Maryann Xue Fix Version/s: 2.4.0 > PivotFirst could not handle pivot columns of complex types > -- > > Key: SPARK-24972 > URL: https://issues.apache.org/jira/browse/SPARK-24972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 2.4.0 > > > {{PivotFirst}} did not handle complex types for pivot columns properly. And > as a result, the pivot column could not be matched with any pivot value and > it always returned empty result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24976) Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
[ https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24976: - Summary: Allow None for Decimal type conversion (specific to PyArrow 0.9.0) (was: Allow None for Decimal type conversion (specific to Arrow 0.9.0)) > Allow None for Decimal type conversion (specific to PyArrow 0.9.0) > -- > > Key: SPARK-24976 > URL: https://issues.apache.org/jira/browse/SPARK-24976 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://jira.apache.org/jira/browse/ARROW-2432 > If we use Arrow 0.9.0, the the test case (None as decimal) failed as below: > {code} > Traceback (most recent call last): > File "/.../spark/python/pyspark/sql/tests.py", line 4672, in > test_vectorized_udf_null_decimal > self.assertEquals(df.collect(), res.collect()) > File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect > sock_info = self._jdf.collectToPython() > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o51.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 > in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/.../spark/python/pyspark/worker.py", line 320, in main > process() > File "/.../spark/python/pyspark/worker.py", line 315, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream > batch = _create_batch(series, self._timezone) > File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch > arrs = [create_array(s, t) for s, t in series] > File "/.../spark/python/pyspark/serializers.py", line 241, in create_array > return pa.Array.from_pandas(s, mask=mask, type=t) > File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24976) Allow None for Decimal type conversion (specific to Arrow 0.9.0)
[ https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24976: Assignee: Apache Spark > Allow None for Decimal type conversion (specific to Arrow 0.9.0) > > > Key: SPARK-24976 > URL: https://issues.apache.org/jira/browse/SPARK-24976 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > See https://jira.apache.org/jira/browse/ARROW-2432 > If we use Arrow 0.9.0, the the test case (None as decimal) failed as below: > {code} > Traceback (most recent call last): > File "/.../spark/python/pyspark/sql/tests.py", line 4672, in > test_vectorized_udf_null_decimal > self.assertEquals(df.collect(), res.collect()) > File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect > sock_info = self._jdf.collectToPython() > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o51.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 > in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/.../spark/python/pyspark/worker.py", line 320, in main > process() > File "/.../spark/python/pyspark/worker.py", line 315, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream > batch = _create_batch(series, self._timezone) > File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch > arrs = [create_array(s, t) for s, t in series] > File "/.../spark/python/pyspark/serializers.py", line 241, in create_array > return pa.Array.from_pandas(s, mask=mask, type=t) > File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24976) Allow None for Decimal type conversion (specific to Arrow 0.9.0)
[ https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24976: Assignee: (was: Apache Spark) > Allow None for Decimal type conversion (specific to Arrow 0.9.0) > > > Key: SPARK-24976 > URL: https://issues.apache.org/jira/browse/SPARK-24976 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://jira.apache.org/jira/browse/ARROW-2432 > If we use Arrow 0.9.0, the the test case (None as decimal) failed as below: > {code} > Traceback (most recent call last): > File "/.../spark/python/pyspark/sql/tests.py", line 4672, in > test_vectorized_udf_null_decimal > self.assertEquals(df.collect(), res.collect()) > File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect > sock_info = self._jdf.collectToPython() > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o51.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 > in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/.../spark/python/pyspark/worker.py", line 320, in main > process() > File "/.../spark/python/pyspark/worker.py", line 315, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream > batch = _create_batch(series, self._timezone) > File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch > arrs = [create_array(s, t) for s, t in series] > File "/.../spark/python/pyspark/serializers.py", line 241, in create_array > return pa.Array.from_pandas(s, mask=mask, type=t) > File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24976) Allow None for Decimal type conversion (specific to Arrow 0.9.0)
[ https://issues.apache.org/jira/browse/SPARK-24976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563117#comment-16563117 ] Apache Spark commented on SPARK-24976: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21928 > Allow None for Decimal type conversion (specific to Arrow 0.9.0) > > > Key: SPARK-24976 > URL: https://issues.apache.org/jira/browse/SPARK-24976 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://jira.apache.org/jira/browse/ARROW-2432 > If we use Arrow 0.9.0, the the test case (None as decimal) failed as below: > {code} > Traceback (most recent call last): > File "/.../spark/python/pyspark/sql/tests.py", line 4672, in > test_vectorized_udf_null_decimal > self.assertEquals(df.collect(), res.collect()) > File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect > sock_info = self._jdf.collectToPython() > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o51.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 > in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/.../spark/python/pyspark/worker.py", line 320, in main > process() > File "/.../spark/python/pyspark/worker.py", line 315, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream > batch = _create_batch(series, self._timezone) > File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch > arrs = [create_array(s, t) for s, t in series] > File "/.../spark/python/pyspark/serializers.py", line 241, in create_array > return pa.Array.from_pandas(s, mask=mask, type=t) > File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24930: -- Due Date: (was: 26/Jul/18) Affects Version/s: (was: 2.2.1) (was: 2.3.0) Priority: Minor (was: Major) > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Xiaochen Ouyang >Priority: Minor > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("create table t1(id int) partitioned by(area string)"); > 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: > Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 > res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the message throwed by `AnalysisException` > can confuse user. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24976) Allow None for Decimal type conversion (specific to Arrow 0.9.0)
Hyukjin Kwon created SPARK-24976: Summary: Allow None for Decimal type conversion (specific to Arrow 0.9.0) Key: SPARK-24976 URL: https://issues.apache.org/jira/browse/SPARK-24976 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.4.0 Reporter: Hyukjin Kwon See https://jira.apache.org/jira/browse/ARROW-2432 If we use Arrow 0.9.0, the the test case (None as decimal) failed as below: {code} Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4672, in test_vectorized_udf_null_decimal self.assertEquals(df.collect(), res.collect()) File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect sock_info = self._jdf.collectToPython() File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o51.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/pyspark/worker.py", line 320, in main process() File "/.../spark/python/pyspark/worker.py", line 315, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream batch = _create_batch(series, self._timezone) File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch arrs = [create_array(s, t) for s, t in series] File "/.../spark/python/pyspark/serializers.py", line 241, in create_array return pa.Array.from_pandas(s, mask=mask, type=t) File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23334) Fix pandas_udf with return type StringType() to handle str type properly in Python 2.
[ https://issues.apache.org/jira/browse/SPARK-23334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23334: - Issue Type: Sub-task (was: Bug) Parent: SPARK-22216 > Fix pandas_udf with return type StringType() to handle str type properly in > Python 2. > - > > Key: SPARK-23334 > URL: https://issues.apache.org/jira/browse/SPARK-23334 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Blocker > Fix For: 2.3.0 > > > In Python 2, when pandas_udf tries to return string type value created in the > udf with {{".."}}, the execution fails. E.g., > {code:java} > from pyspark.sql.functions import pandas_udf, col > import pandas as pd > df = spark.range(10) > str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string") > df.select(str_f(col('id'))).show() > {code} > raises the following exception: > {code} > ... > java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: > expected StringType, got BinaryType > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:93) > ... > {code} > Seems like pyarrow ignores {{type}} parameter for {{pa.Array.from_pandas()}} > and consider it as binary type when the type is string type and the string > values are {{str}} instead of {{unicode}} in Python 2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24820: Assignee: (was: Apache Spark) > Fail fast when submitted job contains PartitionPruningRDD in a barrier stage > > > Key: SPARK-24820 > URL: https://issues.apache.org/jira/browse/SPARK-24820 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Detect SparkContext.runJob() launch a barrier stage including > PartitionPruningRDD. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563109#comment-16563109 ] Apache Spark commented on SPARK-24820: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/21927 > Fail fast when submitted job contains PartitionPruningRDD in a barrier stage > > > Key: SPARK-24820 > URL: https://issues.apache.org/jira/browse/SPARK-24820 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Detect SparkContext.runJob() launch a barrier stage including > PartitionPruningRDD. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24820: Assignee: Apache Spark > Fail fast when submitted job contains PartitionPruningRDD in a barrier stage > > > Key: SPARK-24820 > URL: https://issues.apache.org/jira/browse/SPARK-24820 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Major > > Detect SparkContext.runJob() launch a barrier stage including > PartitionPruningRDD. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24946) PySpark - Allow np.Arrays and pd.Series in df.approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-24946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563052#comment-16563052 ] Hyukjin Kwon commented on SPARK-24946: -- Hmmm. mind if I ask a discussion in dev mailing list after Spark 2.4.0 is released (since committers and guys are busy on this currently)? This one specific case can be trivial but I am worried if we should consider allowing all other cases. > PySpark - Allow np.Arrays and pd.Series in df.approxQuantile > > > Key: SPARK-24946 > URL: https://issues.apache.org/jira/browse/SPARK-24946 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Paul Westenthanner >Priority: Minor > Labels: DataFrame, beginner, pyspark > > As Python user it is convenient to pass a numpy array or pandas series > `{{approxQuantile}}(_col_, _probabilities_, _relativeError_)` for the > probabilities parameter. > > Especially for creating cumulative plots (say in 1% steps) it is handy to use > `approxQuantile(col, np.arange(0, 1.0, 0.01), relativeError)`. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404
shanyu zhao created SPARK-24975: --- Summary: Spark history server REST API /api/v1/version returns error 404 Key: SPARK-24975 URL: https://issues.apache.org/jira/browse/SPARK-24975 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1, 2.3.0 Reporter: shanyu zhao Spark history server REST API provides /api/v1/version, according to doc: [https://spark.apache.org/docs/latest/monitoring.html] However, for Spark 2.3, we see: {code:java} curl http://localhost:18080/api/v1/version Error 404 Not Found HTTP ERROR 404 Problem accessing /api/v1/version. Reason: Not Foundhttp://eclipse.org/jetty";>Powered by Jetty:// 9.3.z-SNAPSHOT {code} On a Spark 2.2 cluster, we see: {code:java} curl http://localhost:18080/api/v1/version { "spark" : "2.2.0" }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23633. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21887 [https://github.com/apache/spark/pull/21887] > Update Pandas UDFs section in sql-programming-guide > > > Key: SPARK-23633 > URL: https://issues.apache.org/jira/browse/SPARK-23633 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > Fix For: 2.4.0 > > > Let's make sure sql-programming-guide is up-to-date before 2.4 release. > https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23633: Assignee: Li Jin > Update Pandas UDFs section in sql-programming-guide > > > Key: SPARK-23633 > URL: https://issues.apache.org/jira/browse/SPARK-23633 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > Fix For: 2.4.0 > > > Let's make sure sql-programming-guide is up-to-date before 2.4 release. > https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24937) Datasource partition table should load empty static partitions
[ https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24937: Summary: Datasource partition table should load empty static partitions (was: Datasource partition table should load empty partitions) > Datasource partition table should load empty static partitions > -- > > Key: SPARK-24937 > URL: https://issues.apache.org/jira/browse/SPARK-24937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > spark-sql> CREATE TABLE tbl AS SELECT 1; > spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > > USING parquet > > PARTITIONED BY (day, hour); > spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > spark-sql> SHOW PARTITIONS tbl1; > spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > > PARTITIONED BY (day STRING, hour STRING); > 18/07/26 22:49:20 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external > table:tbl2 > spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 > 18/07/26 22:49:36 WARN log: Updated size to 0 > spark-sql> SHOW PARTITIONS tbl2; > day=2018-07-25/hour=01 > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24882: Description: Data source V2 is out for a while, see the SPIP [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. We have already migrated most of the built-in streaming data sources to the V2 API, and the file source migration is in progress. During the migration, we found several problems and want to address them before we stabilize the V2 API. To solve these problems, we need to separate responsibilities in the data source v2 API, isolate the stateull part of the API, think of better naming of some interfaces. Details please see the attached google doc: https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing was: Data source V2 is out for a while, see the SPIP [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. We have already migrated most of the built-in streaming data sources to the V2 API, and the file source migration is in progress. During the migration, we found several problems and want to address them before we stabilize the V2 API. To solve these problems, we need to separate responsibilities in the data source v2 API, isolate the stateull parts. Details please see the attached google doc: https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 API, isolate the stateull part of the API, think of better naming > of some interfaces. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24882: Description: Data source V2 is out for a while, see the SPIP [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. We have already migrated most of the built-in streaming data sources to the V2 API, and the file source migration is in progress. During the migration, we found several problems and want to address them before we stabilize the V2 API. To solve these problems, we need to separate responsibilities in the data source v2 API, isolate the stateull parts. Details please see the attached google doc: https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing was: Data source V2 is out for a while, see the SPIP [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. We have already migrated most of the built-in streaming data sources to the V2 API, and the file source migration is in progress. During the migration, we found several problems and want to address them before we stabilize the V2 API. To solve these problems, we need to separate responsibilities in the data source v2 read API. Details please see the attached google doc: https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 API, isolate the stateull parts. Details please see the attached > google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24882: Summary: data source v2 API improvement (was: separate responsibilities of the data source v2 read API) > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24952) Support LZMA2 compression by Avro datasource
[ https://issues.apache.org/jira/browse/SPARK-24952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24952: Assignee: Maxim Gekk > Support LZMA2 compression by Avro datasource > > > Key: SPARK-24952 > URL: https://issues.apache.org/jira/browse/SPARK-24952 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > LZMA2 (XZ) has much more better compression ratio comparing to currently > supported snappy and deflate. Underlying Avro library supports the > compression codec already. Need to set parameters for the codec and allow > users to specify "xz" compression via AvroOptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24952) Support LZMA2 compression by Avro datasource
[ https://issues.apache.org/jira/browse/SPARK-24952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24952. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21902 [https://github.com/apache/spark/pull/21902] > Support LZMA2 compression by Avro datasource > > > Key: SPARK-24952 > URL: https://issues.apache.org/jira/browse/SPARK-24952 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > LZMA2 (XZ) has much more better compression ratio comparing to currently > supported snappy and deflate. Underlying Avro library supports the > compression codec already. Need to set parameters for the codec and allow > users to specify "xz" compression via AvroOptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24972) PivotFirst could not handle pivot columns of complex types
[ https://issues.apache.org/jira/browse/SPARK-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24972: Assignee: Apache Spark > PivotFirst could not handle pivot columns of complex types > -- > > Key: SPARK-24972 > URL: https://issues.apache.org/jira/browse/SPARK-24972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Apache Spark >Priority: Minor > > {{PivotFirst}} did not handle complex types for pivot columns properly. And > as a result, the pivot column could not be matched with any pivot value and > it always returned empty result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24972) PivotFirst could not handle pivot columns of complex types
[ https://issues.apache.org/jira/browse/SPARK-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562659#comment-16562659 ] Apache Spark commented on SPARK-24972: -- User 'maryannxue' has created a pull request for this issue: https://github.com/apache/spark/pull/21926 > PivotFirst could not handle pivot columns of complex types > -- > > Key: SPARK-24972 > URL: https://issues.apache.org/jira/browse/SPARK-24972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Minor > > {{PivotFirst}} did not handle complex types for pivot columns properly. And > as a result, the pivot column could not be matched with any pivot value and > it always returned empty result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24972) PivotFirst could not handle pivot columns of complex types
[ https://issues.apache.org/jira/browse/SPARK-24972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24972: Assignee: (was: Apache Spark) > PivotFirst could not handle pivot columns of complex types > -- > > Key: SPARK-24972 > URL: https://issues.apache.org/jira/browse/SPARK-24972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Minor > > {{PivotFirst}} did not handle complex types for pivot columns properly. And > as a result, the pivot column could not be matched with any pivot value and > it always returned empty result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562638#comment-16562638 ] holdenk commented on SPARK-24579: - [~mengxr]How about you just open comments up in general and then turn it off if Spam becomes a problem? > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562632#comment-16562632 ] koert kuipers commented on SPARK-15516: --- we also ran into this on columns that are not key columns. i am not so sure this has anything to do with key columns. it seems parquet uses StructType.merge, which seems to support type conversion for array types, map types, struct types, and decimal types. for any other types it simply requires them to be the same. anyone know why? > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki >Priority: Major > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24973) Add numIter to Python ClusteringSummary
[ https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24973: Assignee: Apache Spark > Add numIter to Python ClusteringSummary > > > Key: SPARK-24973 > URL: https://issues.apache.org/jira/browse/SPARK-24973 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python > version of ClusteringSummary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24973) Add numIter to Python ClusteringSummary
[ https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24973: Assignee: (was: Apache Spark) > Add numIter to Python ClusteringSummary > > > Key: SPARK-24973 > URL: https://issues.apache.org/jira/browse/SPARK-24973 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Minor > > -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python > version of ClusteringSummary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24973) Add numIter to Python ClusteringSummary
[ https://issues.apache.org/jira/browse/SPARK-24973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562624#comment-16562624 ] Apache Spark commented on SPARK-24973: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/21925 > Add numIter to Python ClusteringSummary > > > Key: SPARK-24973 > URL: https://issues.apache.org/jira/browse/SPARK-24973 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Minor > > -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python > version of ClusteringSummary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrzej.stankev...@gmail.com updated SPARK-24974: - Description: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *report_date* and *type* and i have directory structure like {code:java} /custom_path/report_date=2018-07-24/type=A/file_1.parquet {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. was: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *report_date* and *type* and i have directory structure like {code:java} /custom_path/report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > --- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by *report_date* and *type* and i have directory > structure like > {code:java} > /custom_path/report_date=2018-07-24/type=A/file_1.parquet > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type *A* and it is just a couple of > files. But spark load all 19K of files from all partitions into > SharedInMemoryCache which takes about 60 secs and only after that throws > unused partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrzej.stankev...@gmail.com updated SPARK-24974: - Description: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *type* and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. was: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *type* and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > --- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by *type* and i has directory structure like > {code:java} > report_date=2018-07-24/type=A/file_1 > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type *A* and it is just a couple of > files. But spark load all 19K of files from all partitions into > SharedInMemoryCache which takes about 60 secs and only after that throws > unused partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrzej.stankev...@gmail.com updated SPARK-24974: - Description: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *report_date* and *type* and i have directory structure like {code:java} /custom_path/report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. was: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *type* and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > --- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by *report_date* and *type* and i have directory > structure like > {code:java} > /custom_path/report_date=2018-07-24/type=A/file_1 > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type *A* and it is just a couple of > files. But spark load all 19K of files from all partitions into > SharedInMemoryCache which takes about 60 secs and only after that throws > unused partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrzej.stankev...@gmail.com updated SPARK-24974: - Description: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *type* and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. was: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *type* and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > --- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by *type* and i has directory structure like > {code:java} > report_date=2018-07-24/type=A/file_1 > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type *A* and it is just a couple of > files. But spark load all 19K of files from all partitions into > SharedInMemoryCache which takes about 60 secs and only after that throws > unused partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrzej.stankev...@gmail.com updated SPARK-24974: - Description: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *type* and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type *A* and it is just a couple of files. But spark load all 19K of files from all partitions into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. was: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *type* and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type A and it is just couple of files. But spark load all 19K of files into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > --- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by *type* and i has directory structure like > {code:java} > report_date=2018-07-24/type=A/file_1 > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type *A* and it is just a couple of > files. But spark load all 19K of files from all partitions into > SharedInMemoryCache which takes about 60 secs and only after that throws > unused partitions. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrzej.stankev...@gmail.com updated SPARK-24974: - Description: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by type and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type A and it is just couple of files. But spark load all 19K of files into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. was: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by type and i has directory structure like {code} {{report_date=2018-07-24/type=A/file_1}} {code} I am trying to execute {code} {{val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count}} {code} In my query i need to load only files of type A and it is just couple of files. But spark load all 19K of files into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > --- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by type and i has directory structure like > {code:java} > report_date=2018-07-24/type=A/file_1 > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type A and it is just couple of > files. But spark load all 19K of files into SharedInMemoryCache which takes > about 60 secs and only after that throws unused partitions. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] andrzej.stankev...@gmail.com updated SPARK-24974: - Description: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by *type* and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type A and it is just couple of files. But spark load all 19K of files into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. was: SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by type and i has directory structure like {code:java} report_date=2018-07-24/type=A/file_1 {code} I am trying to execute {code:java} val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count {code} In my query i need to load only files of type A and it is just couple of files. But spark load all 19K of files into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > --- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by *type* and i has directory structure like > {code:java} > report_date=2018-07-24/type=A/file_1 > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type A and it is just couple of > files. But spark load all 19K of files into SharedInMemoryCache which takes > about 60 secs and only after that throws unused partitions. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24974) Spark put all file's paths into SharedInMemoryCache even for unused partitions.
andrzej.stankev...@gmail.com created SPARK-24974: Summary: Spark put all file's paths into SharedInMemoryCache even for unused partitions. Key: SPARK-24974 URL: https://issues.apache.org/jira/browse/SPARK-24974 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: andrzej.stankev...@gmail.com SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions. I partitioned files by type and i has directory structure like {code} {{report_date=2018-07-24/type=A/file_1}} {code} I am trying to execute {code} {{val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count}} {code} In my query i need to load only files of type A and it is just couple of files. But spark load all 19K of files into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24973) Add numIter to Python ClusteringSummary
Huaxin Gao created SPARK-24973: -- Summary: Add numIter to Python ClusteringSummary Key: SPARK-24973 URL: https://issues.apache.org/jira/browse/SPARK-24973 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.4.0 Reporter: Huaxin Gao -SPARK-23528- added numIter to ClusteringSummary. Will add numIter to Python version of ClusteringSummary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24972) PivotFirst could not handle pivot columns of complex types
Maryann Xue created SPARK-24972: --- Summary: PivotFirst could not handle pivot columns of complex types Key: SPARK-24972 URL: https://issues.apache.org/jira/browse/SPARK-24972 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Maryann Xue {{PivotFirst}} did not handle complex types for pivot columns properly. And as a result, the pivot column could not be matched with any pivot value and it always returned empty result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default
[ https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562592#comment-16562592 ] Apache Spark commented on SPARK-24963: -- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/21924 > Integration tests will fail if they run in a namespace not being the default > > > Key: SPARK-24963 > URL: https://issues.apache.org/jira/browse/SPARK-24963 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Minor > Fix For: 2.4.0 > > > Related discussion is here: > [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893] > If spark-rbac.yaml is used when tests are used locally, client mode tests > will fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562521#comment-16562521 ] Apache Spark commented on SPARK-24918: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/21923 > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24918: Assignee: (was: Apache Spark) > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24918: Assignee: Apache Spark > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Major > Labels: memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true
[ https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562464#comment-16562464 ] Quentin Ambard commented on SPARK-24720: Something is wrong with the way the consumers are closed. I'll submit a correction using the existing consumer cache instead of creating a new one. > kafka transaction creates Non-consecutive Offsets (due to transaction offset) > making streaming fail when failOnDataLoss=true > > > Key: SPARK-24720 > URL: https://issues.apache.org/jira/browse/SPARK-24720 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Priority: Major > > When kafka transactions are used, sending 1 message to kafka will result to 1 > offset for the data + 1 offset to mark the transaction. > When kafka connector for spark streaming read a topic with non-consecutive > offset, it leads to a failure. SPARK-17147 fixed this issue for compacted > topics. > However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 > message + 1 transaction commit are in a partition, spark will try to read > offsets [0 2[. offset 0 (containing the message) will be read, but offset 1 > won't return a value and buffer.hasNext() will be false even after a poll > since no data are present for offset 1 (it's the transaction commit) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24961) sort operation causes out of memory
[ https://issues.apache.org/jira/browse/SPARK-24961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562359#comment-16562359 ] Markus Breuer commented on SPARK-24961: --- I think it is an issue and I explained why. But I also posted to stackoverflow (some days ago) and adressed mailinglist, but got no feedback on this issue. Probably there is no easy answer to this issue, but probably my example helps to reproduce it. > sort operation causes out of memory > > > Key: SPARK-24961 > URL: https://issues.apache.org/jira/browse/SPARK-24961 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.1 > Environment: Java 1.8u144+ > Windows 10 > Spark 2.3.1 in local mode > -Xms4g -Xmx4g > optional: -XX:+UseParallelOldGC >Reporter: Markus Breuer >Priority: Major > > A sort operation on large rdd - which does not fit in memory - causes out of > memory exception. I made the effect reproducable by an sample, the sample > creates large object of about 2mb size. When saving result the oom occurs. I > tried several StorageLevels, but if memory is included (MEMORY_AND_DISK, > MEMORY_AND_DISK_SER, none) application runs in out of memory. Only DISK_ONLY > seems to work. > When replacing sort() with sortWithinPartitions() no StorageLevel is required > and application succeeds. > {code:java} > package de.bytefusion.examples; > import breeze.storage.Storage; > import de.bytefusion.utils.Options; > import org.apache.hadoop.io.MapFile; > import org.apache.hadoop.io.SequenceFile; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapred.SequenceFileOutputFormat; > import org.apache.spark.api.java.JavaRDD; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructType; > import org.apache.spark.storage.StorageLevel; > import scala.Tuple2; > import static org.apache.spark.sql.functions.*; > import java.util.ArrayList; > import java.util.List; > import java.util.UUID; > import java.util.stream.Collectors; > import java.util.stream.IntStream; > public class Example3 { > public static void main(String... args) { > // create spark session > SparkSession spark = SparkSession.builder() > .appName("example1") > .master("local[4]") > .config("spark.driver.maxResultSize","1g") > .config("spark.driver.memory","512m") > .config("spark.executor.memory","512m") > .config("spark.local.dir","d:/temp/spark-tmp") > .getOrCreate(); > JavaSparkContext sc = > JavaSparkContext.fromSparkContext(spark.sparkContext()); > // base to generate huge data > List list = new ArrayList<>(); > for (int val = 1; val < 1; val++) { > int valueOf = Integer.valueOf(val); > list.add(valueOf); > } > // create simple rdd of int > JavaRDD rdd = sc.parallelize(list,200); > // use map to create large object per row > JavaRDD rowRDD = > rdd > .map(value -> > RowFactory.create(String.valueOf(value), > createLongText(UUID.randomUUID().toString(), 2 * 1024 * 1024))) > // no persist => out of memory exception on write() > // persist MEMORY_AND_DISK => out of memory exception > on write() > // persist MEMORY_AND_DISK_SER => out of memory > exception on write() > // persist(StorageLevel.DISK_ONLY()) > ; > StructType type = new StructType(); > type = type > .add("c1", DataTypes.StringType) > .add( "c2", DataTypes.StringType ); > Dataset df = spark.createDataFrame(rowRDD, type); > // works > df.show(); > df = df > .sort(col("c1").asc() ) > ; > df.explain(); > // takes a lot of time but works > df.show(); > // OutOfMemoryError: java heap space > df > .write() > .mode("overwrite") > .csv("d:/temp/my.csv"); > // OutOfMemoryError: java heap space > df > .toJavaRDD() > .mapToPair(row -> new Tuple2(new Text(row.getString(0)), new > Text( row.getString(1 > .saveAsHadoopFile("d:\\temp\\foo", Text.class, Text.class, > SequenceFileOutputFormat.class ); > } > private static String createLongText( String text, int minLength ) { > StringBuffer sb =
[jira] [Resolved] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default
[ https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah resolved SPARK-24963. Resolution: Fixed Fix Version/s: 2.4.0 > Integration tests will fail if they run in a namespace not being the default > > > Key: SPARK-24963 > URL: https://issues.apache.org/jira/browse/SPARK-24963 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Minor > Fix For: 2.4.0 > > > Related discussion is here: > [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893] > If spark-rbac.yaml is used when tests are used locally, client mode tests > will fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default
[ https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562319#comment-16562319 ] Apache Spark commented on SPARK-24963: -- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/21900 > Integration tests will fail if they run in a namespace not being the default > > > Key: SPARK-24963 > URL: https://issues.apache.org/jira/browse/SPARK-24963 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > Related discussion is here: > [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893] > If spark-rbac.yaml is used when tests are used locally, client mode tests > will fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default
[ https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24963: Assignee: (was: Apache Spark) > Integration tests will fail if they run in a namespace not being the default > > > Key: SPARK-24963 > URL: https://issues.apache.org/jira/browse/SPARK-24963 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > Related discussion is here: > [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893] > If spark-rbac.yaml is used when tests are used locally, client mode tests > will fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default
[ https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24963: Assignee: Apache Spark > Integration tests will fail if they run in a namespace not being the default > > > Key: SPARK-24963 > URL: https://issues.apache.org/jira/browse/SPARK-24963 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Assignee: Apache Spark >Priority: Minor > > Related discussion is here: > [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893] > If spark-rbac.yaml is used when tests are used locally, client mode tests > will fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24971) remove SupportsDeprecatedScanRow
[ https://issues.apache.org/jira/browse/SPARK-24971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24971: Assignee: Wenchen Fan (was: Apache Spark) > remove SupportsDeprecatedScanRow > > > Key: SPARK-24971 > URL: https://issues.apache.org/jira/browse/SPARK-24971 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24971) remove SupportsDeprecatedScanRow
[ https://issues.apache.org/jira/browse/SPARK-24971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562149#comment-16562149 ] Apache Spark commented on SPARK-24971: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21921 > remove SupportsDeprecatedScanRow > > > Key: SPARK-24971 > URL: https://issues.apache.org/jira/browse/SPARK-24971 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24971) remove SupportsDeprecatedScanRow
[ https://issues.apache.org/jira/browse/SPARK-24971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24971: Assignee: Apache Spark (was: Wenchen Fan) > remove SupportsDeprecatedScanRow > > > Key: SPARK-24971 > URL: https://issues.apache.org/jira/browse/SPARK-24971 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24971) remove SupportsDeprecatedScanRow
Wenchen Fan created SPARK-24971: --- Summary: remove SupportsDeprecatedScanRow Key: SPARK-24971 URL: https://issues.apache.org/jira/browse/SPARK-24971 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4
[ https://issues.apache.org/jira/browse/SPARK-24956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562141#comment-16562141 ] Apache Spark commented on SPARK-24956: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21920 > Upgrade maven from 3.3.9 to 3.5.4 > - > > Key: SPARK-24956 > URL: https://issues.apache.org/jira/browse/SPARK-24956 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.4.0 > > > Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest. > As suggest in SPARK-24895, the current maven will see a problem with some > plugins. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24934) Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases
[ https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24934: - Affects Version/s: 2.3.1 > Complex type and binary type in in-memory partition pruning does not work due > to missing upper/lower bounds cases > - > > Key: SPARK-24934 > URL: https://issues.apache.org/jira/browse/SPARK-24934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > > For example, if array is used (where the lower and upper bounds for its > column batch are {{null}})), it looks wrongly filtering all data out: > {code} > scala> import org.apache.spark.sql.functions > import org.apache.spark.sql.functions > scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") > df: org.apache.spark.sql.DataFrame = [arrayCol: array] > scala> > df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > | [a, b]| > ++ > scala> > df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24934) Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases
[ https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562126#comment-16562126 ] Hyukjin Kwon commented on SPARK-24934: -- I think this has been a bug from the first place. It at least affects 2.3.1. I manually tested: {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.spark.sql.functions import org.apache.spark.sql.functions scala> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") df: org.apache.spark.sql.DataFrame = [arrayCol: array] scala> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), functions.lit("b".show() ++ |arrayCol| ++ | [a, b]| ++ scala> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), functions.lit("b".show() ++ |arrayCol| ++ ++ {code} > Complex type and binary type in in-memory partition pruning does not work due > to missing upper/lower bounds cases > - > > Key: SPARK-24934 > URL: https://issues.apache.org/jira/browse/SPARK-24934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > > For example, if array is used (where the lower and upper bounds for its > column batch are {{null}})), it looks wrongly filtering all data out: > {code} > scala> import org.apache.spark.sql.functions > import org.apache.spark.sql.functions > scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") > df: org.apache.spark.sql.DataFrame = [arrayCol: array] > scala> > df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > | [a, b]| > ++ > scala> > df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562114#comment-16562114 ] Wenchen Fan commented on SPARK-24882: - [~rdblue] I do agree that creating a catalog via reflection and then using catalog to create a `ReadSupport` instance is cleaner. But the problem is then we need to make `CatatalogSupport` a must-have for data sources instead of an optional plugin. How about we rename the old `ReadSupport` to `ReadSupportProvider` for data sources that don't have a catalog? It works like a dynamic constructor of `ReadSupport` so that Spark can create `ReadSupport` by reflection. For the builder issue, I'm ok with adding a `ScanConfigBuilder` that is mutable and can mix in the `SupportsPushdownXYZ` traits, to make `ScanConfig` immutable. I think this model is simpler: The `ScanConfigBuilder` tracks all the pushed operators, checks the current status and gives feedback to Spark about the next operator pushdown. We can design a pure builder-like pushdown API for `ScanConfigBuilder` later. We need to support more operators pushdown to evaluate the design, so it seems safer to keep the pushdown API unchanged for now. What do you think? > separate responsibilities of the data source v2 read API > > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562058#comment-16562058 ] Ryan Blue edited comment on SPARK-24882 at 7/30/18 4:02 PM: [~cloud_fan], when you say that "ReadSupport is created via reflection and should be very light-weight", that raises a red flag for me. As I've mentioned in the discussions on table catalog support, I think that we should instantiate the catalog through reflection, not the data source. Starting with the data source is backward and only works now because we have just one global catalog. In my catalog support PR, the catalog is what gets instantiated via reflection. In that model, the initialization you need would be in the Table instance's life-cycle and the catalog would return tables that implement ReadSupport. So, I think we should consider the impact of multiple catalog support. I think that if we go with the catalog proposal, it would make this change cleaner because we would be able to remove the extra DataSourceReader level. This would also end up working like a lot of other Table and Scan interfaces. Iceberg, for example, has a Table that you call newScan on to get a configurable TableScan. That's very similar to how this API would end up: Table implements ReadSupport, ReadSupport exposes newScanConfig. Getting a Reader that doesn't actually read data just doesn't make sense. For the builder discussion: Does the current proposal work for the {{Filter(Limit(Filter(Scan)))}} case? I don't think it does because implementations expect predicates to be pushed just once. I think that these cases should probably be handled by a more generic push-down that passes parts of the plan tree. If that's the case, then the builder is simpler and immutable. I'd rather go with the builder for now to set the expectation that ScanConfig is immutable. We *could* add an intermediate class, like {{ConfigurableScanConfig}}, that exposes the {{SupportsXYZ}} pushdown interfaces. Then, we could add a method to {{ConfigurableScanConfig}} to get a final, immutable {{ScanConfig}}. But, that's really just a builder with a more difficult to implement interface. In the event that we do find pushdown operations that are incompatible with a builder – and not incompatible with both a builder and the current proposal like your example – then we can always add a new way to build or configure a {{ScanConfig}} later. The important thing is that the ScanConfig is immutable to provide the guarantees that you want in this API: that the ScanConfig won't change between calls to {{estimateStatistics}} and {{planInputPartitions}}. was (Author: rdblue): [~cloud_fan], when you say that "ReadSupport is created via reflection and should be very light-weight", that raises a red flag for me. As I've mentioned in the discussions on table catalog support, I think that we should instantiate the catalog through reflection, not the data source. Starting with the data source is backward and only works now because we have just one global catalog. In my catalog support PR, the catalog is what gets instantiated via reflection. In that model, the initialization you need would be in the Table instance's life-cycle and the catalog would return tables that implement ReadSupport. So, I think we should consider the impact of multiple catalog support. I think that if we go with the catalog proposal, it would make this change cleaner because we would be able to remove the extra DataSourceReader level. This would also end up working like a lot of other Table and Scan interfaces. Iceberg, for example, has a Table that you call newScan on to get a configurable TableScan. That's very similar to how this API would end up: Table implements ReadSupport, ReadSupport exposes newScanConfig. Getting a Reader that doesn't really read data just doesn't make sense. For the builder discussion: Does the current proposal work for the {{Filter(Limit(Filter(Scan)))}} case? I don't think it does because implementations expect predicates to be pushed just once. I think that these cases should probably be handled by a more generic push-down that passes parts of the plan tree. If that's the case, then the builder is simpler and immutable. I'd rather go with the builder for now to set the expectation that ScanConfig is immutable. We *could* add an intermediate class, like {{ConfigurableScanConfig}}, that exposes the {{SupportsXYZ}} pushdown interfaces. Then, we could add a method to {{ConfigurableScanConfig}} to get a final, immutable {{ScanConfig}}. But, that's really just a builder with a more difficult to implement interface. In the event that we do find pushdown operations that are incompatible with a builder – and not incompatible with both a builder and the current proposal like your example – then we can always add a new way to b
[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562058#comment-16562058 ] Ryan Blue commented on SPARK-24882: --- [~cloud_fan], when you say that "ReadSupport is created via reflection and should be very light-weight", that raises a red flag for me. As I've mentioned in the discussions on table catalog support, I think that we should instantiate the catalog through reflection, not the data source. Starting with the data source is backward and only works now because we have just one global catalog. In my catalog support PR, the catalog is what gets instantiated via reflection. In that model, the initialization you need would be in the Table instance's life-cycle and the catalog would return tables that implement ReadSupport. So, I think we should consider the impact of multiple catalog support. I think that if we go with the catalog proposal, it would make this change cleaner because we would be able to remove the extra DataSourceReader level. This would also end up working like a lot of other Table and Scan interfaces. Iceberg, for example, has a Table that you call newScan on to get a configurable TableScan. That's very similar to how this API would end up: Table implements ReadSupport, ReadSupport exposes newScanConfig. Getting a Reader that doesn't really read data just doesn't make sense. For the builder discussion: Does the current proposal work for the {{Filter(Limit(Filter(Scan)))}} case? I don't think it does because implementations expect predicates to be pushed just once. I think that these cases should probably be handled by a more generic push-down that passes parts of the plan tree. If that's the case, then the builder is simpler and immutable. I'd rather go with the builder for now to set the expectation that ScanConfig is immutable. We *could* add an intermediate class, like {{ConfigurableScanConfig}}, that exposes the {{SupportsXYZ}} pushdown interfaces. Then, we could add a method to {{ConfigurableScanConfig}} to get a final, immutable {{ScanConfig}}. But, that's really just a builder with a more difficult to implement interface. In the event that we do find pushdown operations that are incompatible with a builder – and not incompatible with both a builder and the current proposal like your example – then we can always add a new way to build or configure a {{ScanConfig}} later. The important thing is that the ScanConfig is immutable to provide the guarantees that you want in this API: that the ScanConfig won't change between calls to {{estimateStatistics}} and {{planInputPartitions}}. > separate responsibilities of the data source v2 read API > > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24933) SinkProgress should report written rows
[ https://issues.apache.org/jira/browse/SPARK-24933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24933: Assignee: (was: Apache Spark) > SinkProgress should report written rows > --- > > Key: SPARK-24933 > URL: https://issues.apache.org/jira/browse/SPARK-24933 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Vaclav Kosar >Priority: Major > > SinkProgress should report similar properties like SourceProgress as long as > they are available for given Sink. Count of written rows is metric availble > for all Sinks. Since relevant progress information is with respect to > commited rows, ideal object to carry this info is WriterCommitMessage. For > brevity the implementation will focus only on Sinks with API V2 and on Micro > Batch mode. Implemention for Continuous mode will be provided at later date. > h4. Before > {code} > {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317"} > {code} > h4. After > {code} > {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317","numOutputRows":5000} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24933) SinkProgress should report written rows
[ https://issues.apache.org/jira/browse/SPARK-24933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562044#comment-16562044 ] Apache Spark commented on SPARK-24933: -- User 'vackosar' has created a pull request for this issue: https://github.com/apache/spark/pull/21919 > SinkProgress should report written rows > --- > > Key: SPARK-24933 > URL: https://issues.apache.org/jira/browse/SPARK-24933 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Vaclav Kosar >Priority: Major > > SinkProgress should report similar properties like SourceProgress as long as > they are available for given Sink. Count of written rows is metric availble > for all Sinks. Since relevant progress information is with respect to > commited rows, ideal object to carry this info is WriterCommitMessage. For > brevity the implementation will focus only on Sinks with API V2 and on Micro > Batch mode. Implemention for Continuous mode will be provided at later date. > h4. Before > {code} > {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317"} > {code} > h4. After > {code} > {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317","numOutputRows":5000} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24933) SinkProgress should report written rows
[ https://issues.apache.org/jira/browse/SPARK-24933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24933: Assignee: Apache Spark > SinkProgress should report written rows > --- > > Key: SPARK-24933 > URL: https://issues.apache.org/jira/browse/SPARK-24933 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Vaclav Kosar >Assignee: Apache Spark >Priority: Major > > SinkProgress should report similar properties like SourceProgress as long as > they are available for given Sink. Count of written rows is metric availble > for all Sinks. Since relevant progress information is with respect to > commited rows, ideal object to carry this info is WriterCommitMessage. For > brevity the implementation will focus only on Sinks with API V2 and on Micro > Batch mode. Implemention for Continuous mode will be provided at later date. > h4. Before > {code} > {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317"} > {code} > h4. After > {code} > {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317","numOutputRows":5000} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24821: Assignee: (was: Apache Spark) > Fail fast when submitted job compute on a subset of all the partitions for a > barrier stage > -- > > Key: SPARK-24821 > URL: https://issues.apache.org/jira/browse/SPARK-24821 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Detect SparkContext.runJob() launch a barrier stage with a subset of all the > partitions, one example is the `first()` operation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562031#comment-16562031 ] Apache Spark commented on SPARK-24821: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/21918 > Fail fast when submitted job compute on a subset of all the partitions for a > barrier stage > -- > > Key: SPARK-24821 > URL: https://issues.apache.org/jira/browse/SPARK-24821 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Detect SparkContext.runJob() launch a barrier stage with a subset of all the > partitions, one example is the `first()` operation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24821: Assignee: Apache Spark > Fail fast when submitted job compute on a subset of all the partitions for a > barrier stage > -- > > Key: SPARK-24821 > URL: https://issues.apache.org/jira/browse/SPARK-24821 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Major > > Detect SparkContext.runJob() launch a barrier stage with a subset of all the > partitions, one example is the `first()` operation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true
[ https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24720: Assignee: Apache Spark > kafka transaction creates Non-consecutive Offsets (due to transaction offset) > making streaming fail when failOnDataLoss=true > > > Key: SPARK-24720 > URL: https://issues.apache.org/jira/browse/SPARK-24720 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Assignee: Apache Spark >Priority: Major > > When kafka transactions are used, sending 1 message to kafka will result to 1 > offset for the data + 1 offset to mark the transaction. > When kafka connector for spark streaming read a topic with non-consecutive > offset, it leads to a failure. SPARK-17147 fixed this issue for compacted > topics. > However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 > message + 1 transaction commit are in a partition, spark will try to read > offsets [0 2[. offset 0 (containing the message) will be read, but offset 1 > won't return a value and buffer.hasNext() will be false even after a poll > since no data are present for offset 1 (it's the transaction commit) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true
[ https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562000#comment-16562000 ] Apache Spark commented on SPARK-24720: -- User 'QuentinAmbard' has created a pull request for this issue: https://github.com/apache/spark/pull/21917 > kafka transaction creates Non-consecutive Offsets (due to transaction offset) > making streaming fail when failOnDataLoss=true > > > Key: SPARK-24720 > URL: https://issues.apache.org/jira/browse/SPARK-24720 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Priority: Major > > When kafka transactions are used, sending 1 message to kafka will result to 1 > offset for the data + 1 offset to mark the transaction. > When kafka connector for spark streaming read a topic with non-consecutive > offset, it leads to a failure. SPARK-17147 fixed this issue for compacted > topics. > However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 > message + 1 transaction commit are in a partition, spark will try to read > offsets [0 2[. offset 0 (containing the message) will be read, but offset 1 > won't return a value and buffer.hasNext() will be false even after a poll > since no data are present for offset 1 (it's the transaction commit) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true
[ https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24720: Assignee: (was: Apache Spark) > kafka transaction creates Non-consecutive Offsets (due to transaction offset) > making streaming fail when failOnDataLoss=true > > > Key: SPARK-24720 > URL: https://issues.apache.org/jira/browse/SPARK-24720 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Priority: Major > > When kafka transactions are used, sending 1 message to kafka will result to 1 > offset for the data + 1 offset to mark the transaction. > When kafka connector for spark streaming read a topic with non-consecutive > offset, it leads to a failure. SPARK-17147 fixed this issue for compacted > topics. > However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 > message + 1 transaction commit are in a partition, spark will try to read > offsets [0 2[. offset 0 (containing the message) will be read, but offset 1 > won't return a value and buffer.hasNext() will be false even after a poll > since no data are present for offset 1 (it's the transaction commit) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24958) Report executors' process tree total memory information to heartbeat signals
[ https://issues.apache.org/jira/browse/SPARK-24958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24958: Assignee: Apache Spark > Report executors' process tree total memory information to heartbeat signals > > > Key: SPARK-24958 > URL: https://issues.apache.org/jira/browse/SPARK-24958 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Reza Safi >Assignee: Apache Spark >Priority: Major > > Spark executors' process tree total memory information can be really useful. > Currently such information are not available. The goal of this task is to > compute such information for each executor, add these information to the > heartbeat signals, and compute the peaks at the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24958) Report executors' process tree total memory information to heartbeat signals
[ https://issues.apache.org/jira/browse/SPARK-24958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561990#comment-16561990 ] Apache Spark commented on SPARK-24958: -- User 'rezasafi' has created a pull request for this issue: https://github.com/apache/spark/pull/21916 > Report executors' process tree total memory information to heartbeat signals > > > Key: SPARK-24958 > URL: https://issues.apache.org/jira/browse/SPARK-24958 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Reza Safi >Priority: Major > > Spark executors' process tree total memory information can be really useful. > Currently such information are not available. The goal of this task is to > compute such information for each executor, add these information to the > heartbeat signals, and compute the peaks at the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24958) Report executors' process tree total memory information to heartbeat signals
[ https://issues.apache.org/jira/browse/SPARK-24958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24958: Assignee: (was: Apache Spark) > Report executors' process tree total memory information to heartbeat signals > > > Key: SPARK-24958 > URL: https://issues.apache.org/jira/browse/SPARK-24958 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Reza Safi >Priority: Major > > Spark executors' process tree total memory information can be really useful. > Currently such information are not available. The goal of this task is to > compute such information for each executor, add these information to the > heartbeat signals, and compute the peaks at the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24582) Design: Barrier execution mode
[ https://issues.apache.org/jira/browse/SPARK-24582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiang Xingbo resolved SPARK-24582. -- Resolution: Fixed > Design: Barrier execution mode > -- > > Key: SPARK-24582 > URL: https://issues.apache.org/jira/browse/SPARK-24582 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > [~jiangxb1987] and [~cloud_fan] outlined a design sketch in SPARK-24375, > which covers some basic scenarios. This story is for a formal design of the > barrier execution mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24581) Design: BarrierTaskContext.barrier()
[ https://issues.apache.org/jira/browse/SPARK-24581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiang Xingbo resolved SPARK-24581. -- Resolution: Fixed > Design: BarrierTaskContext.barrier() > > > Key: SPARK-24581 > URL: https://issues.apache.org/jira/browse/SPARK-24581 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Jiang Xingbo >Priority: Major > > We need to provide a communication barrier function to users to help > coordinate tasks within a barrier stage. This is very similar to MPI_Barrier > function in MPI. This story is for its design. > > Requirements: > * Low-latency. The tasks should be unblocked soon after all tasks have > reached this barrier. The latency is more important than CPU cycles here. > * Support unlimited timeout with proper logging. For DL tasks, it might take > very long to converge, we should support unlimited timeout with proper > logging. So users know why a task is waiting. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22814) JDBC support date/timestamp type as partitionColumn
[ https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22814. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.4.0 > JDBC support date/timestamp type as partitionColumn > --- > > Key: SPARK-22814 > URL: https://issues.apache.org/jira/browse/SPARK-22814 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.2.1 >Reporter: Yuechen Chen >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > In spark, you can partition MySQL queries by partitionColumn. > val df = (spark.read.jdbc(url=jdbcUrl, > table="employees", > columnName="emp_no", > lowerBound=1L, > upperBound=10L, > numPartitions=100, > connectionProperties=connectionProperties)) > display(df) > But, partitionColumn must be a numeric column from the table. > However, there are lots of table, which has no primary key, and has some > date/timestamp indexes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled
[ https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24954: Assignee: (was: Apache Spark) > Fail fast on job submit if run a barrier stage with dynamic resource > allocation enabled > --- > > Key: SPARK-24954 > URL: https://issues.apache.org/jira/browse/SPARK-24954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Blocker > > Since we explicitly listed "Support running barrier stage with dynamic > resource allocation" a Non-Goal in the design doc, we shall fail fast on job > submit if running a barrier stage with dynamic resource allocation enabled, > to avoid some confusing behaviors (can refer to SPARK-24942 for some > examples). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled
[ https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561985#comment-16561985 ] Apache Spark commented on SPARK-24954: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/21915 > Fail fast on job submit if run a barrier stage with dynamic resource > allocation enabled > --- > > Key: SPARK-24954 > URL: https://issues.apache.org/jira/browse/SPARK-24954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Blocker > > Since we explicitly listed "Support running barrier stage with dynamic > resource allocation" a Non-Goal in the design doc, we shall fail fast on job > submit if running a barrier stage with dynamic resource allocation enabled, > to avoid some confusing behaviors (can refer to SPARK-24942 for some > examples). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled
[ https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24954: Assignee: Apache Spark > Fail fast on job submit if run a barrier stage with dynamic resource > allocation enabled > --- > > Key: SPARK-24954 > URL: https://issues.apache.org/jira/browse/SPARK-24954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Blocker > > Since we explicitly listed "Support running barrier stage with dynamic > resource allocation" a Non-Goal in the design doc, we shall fail fast on job > submit if running a barrier stage with dynamic resource allocation enabled, > to avoid some confusing behaviors (can refer to SPARK-24942 for some > examples). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24771. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.4.0 > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition
[ https://issues.apache.org/jira/browse/SPARK-24965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561976#comment-16561976 ] Apache Spark commented on SPARK-24965: -- User 'krisgeus' has created a pull request for this issue: https://github.com/apache/spark/pull/21893 > Spark SQL fails when reading a partitioned hive table with different formats > per partition > -- > > Key: SPARK-24965 > URL: https://issues.apache.org/jira/browse/SPARK-24965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Kris Geusebroek >Priority: Major > Labels: pull-request-available > > When a hive parquet partitioned table contains a partition with a different > format (avro for example) the select * fails with a read exception (avro file > is not a parquet file) > Selecting in hive acts as expected. > To support this a new sql syntax needed to be supported also: > * ALTER TABLE SET FILEFORMAT > This is included in the same PR since the unittest needs this to setup the > testdata. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition
[ https://issues.apache.org/jira/browse/SPARK-24965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24965: Assignee: Apache Spark > Spark SQL fails when reading a partitioned hive table with different formats > per partition > -- > > Key: SPARK-24965 > URL: https://issues.apache.org/jira/browse/SPARK-24965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Kris Geusebroek >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > When a hive parquet partitioned table contains a partition with a different > format (avro for example) the select * fails with a read exception (avro file > is not a parquet file) > Selecting in hive acts as expected. > To support this a new sql syntax needed to be supported also: > * ALTER TABLE SET FILEFORMAT > This is included in the same PR since the unittest needs this to setup the > testdata. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition
[ https://issues.apache.org/jira/browse/SPARK-24965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24965: Assignee: (was: Apache Spark) > Spark SQL fails when reading a partitioned hive table with different formats > per partition > -- > > Key: SPARK-24965 > URL: https://issues.apache.org/jira/browse/SPARK-24965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Kris Geusebroek >Priority: Major > Labels: pull-request-available > > When a hive parquet partitioned table contains a partition with a different > format (avro for example) the select * fails with a read exception (avro file > is not a parquet file) > Selecting in hive acts as expected. > To support this a new sql syntax needed to be supported also: > * ALTER TABLE SET FILEFORMAT > This is included in the same PR since the unittest needs this to setup the > testdata. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24934) Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases
[ https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561944#comment-16561944 ] Thomas Graves commented on SPARK-24934: --- what is the real affected versions here? Since it went into spark 2.3.2 does it affect 2.3.1? > Complex type and binary type in in-memory partition pruning does not work due > to missing upper/lower bounds cases > - > > Key: SPARK-24934 > URL: https://issues.apache.org/jira/browse/SPARK-24934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > > For example, if array is used (where the lower and upper bounds for its > column batch are {{null}})), it looks wrongly filtering all data out: > {code} > scala> import org.apache.spark.sql.functions > import org.apache.spark.sql.functions > scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") > df: org.apache.spark.sql.DataFrame = [arrayCol: array] > scala> > df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > | [a, b]| > ++ > scala> > df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException
[ https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bruce_zhao updated SPARK-24970: --- Description: We're using Spark streaming to consume Kinesis data, and found that it reads more data from Kinesis and is easy to touch ProvisionedThroughputExceededException *when it recovers from streaming checkpoint*. Normally, it's a WARN in spark log. But when we have multiple streaming applications (i.e., 5 applications) to consume the same Kinesis stream, the situation becomes serious. *The application will fail to recover due to the following exception in driver.* And one application failure will also affect the other running applications. {panel:title=Exception} org.apache.spark.SparkException: Job aborted due to stage failure: {color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): org.apache.spark.SparkException: Gave up after 3 retries while getting records using shard iterator, last exception: at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Caused by: *{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: Rate exceeded for shard shardId- in stream rellfsstream-an under account 1{color}.* (Service: AmazonKinesis; Status Code: 400; Error Code: ProvisionedThroughputExceededException; Request ID: d3520677-060e-14c4-8014-2886b6b75f03) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219) at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195) at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004) at com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269) {panel} After check the source code, we found it calls getBlockFromKinesis() to recover data and in this function it accesses Kinesis directly to read data. As all partitions in the BlockRDD will access Kinesis, and AWS Kinesis only supports 5 concurrency reads per shard per second, it will touch ProvisionedThroughputExceededException easily. Even the code does some retries, it's still easy to fail when con
[jira] [Assigned] (SPARK-24967) Use internal.Logging instead for logging
[ https://issues.apache.org/jira/browse/SPARK-24967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24967: --- Assignee: Hyukjin Kwon > Use internal.Logging instead for logging > > > Key: SPARK-24967 > URL: https://issues.apache.org/jira/browse/SPARK-24967 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.4.0 > > > Looks AvroFileFormat directly uses {{getLogger}}. Should better use > {{internal.Logging}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24967) Use internal.Logging instead for logging
[ https://issues.apache.org/jira/browse/SPARK-24967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24967. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21914 [https://github.com/apache/spark/pull/21914] > Use internal.Logging instead for logging > > > Key: SPARK-24967 > URL: https://issues.apache.org/jira/browse/SPARK-24967 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.4.0 > > > Looks AvroFileFormat directly uses {{getLogger}}. Should better use > {{internal.Logging}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen
[ https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24957. - Resolution: Fixed Fix Version/s: 2.3.2 2.4.0 Issue resolved by pull request 21910 [https://github.com/apache/spark/pull/21910] > Decimal arithmetic can lead to wrong values using codegen > - > > Key: SPARK-24957 > URL: https://issues.apache.org/jira/browse/SPARK-24957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: David Vogelbacher >Assignee: Marco Gaido >Priority: Major > Labels: correctness > Fix For: 2.4.0, 2.3.2 > > > I noticed a bug when doing arithmetic on a dataframe containing decimal > values with codegen enabled. > I tried to narrow it down on a small repro and got this (executed in > spark-shell): > {noformat} > scala> val df = Seq( > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("11.88")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("11.88")), > | ("a", BigDecimal("11.88")) > | ).toDF("text", "number") > df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)] > scala> val df_grouped_1 = > df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number")) > df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: > decimal(38,22)] > scala> df_grouped_1.collect() > res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143]) > scala> val df_grouped_2 = > df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number")) > df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: > decimal(38,22)] > scala> df_grouped_2.collect() > res1: Array[org.apache.spark.sql.Row] = > Array([a,11948571.4285714285714285714286]) > scala> val df_total_sum = > df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number")) > df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)] > scala> df_total_sum.collect() > res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143]) > {noformat} > The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the > result of {{df_grouped_2}} is clearly incorrect (it is the value of the > correct result times {{10^14}}). > When codegen is disabled all results are correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen
[ https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24957: --- Assignee: Marco Gaido > Decimal arithmetic can lead to wrong values using codegen > - > > Key: SPARK-24957 > URL: https://issues.apache.org/jira/browse/SPARK-24957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: David Vogelbacher >Assignee: Marco Gaido >Priority: Major > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > > I noticed a bug when doing arithmetic on a dataframe containing decimal > values with codegen enabled. > I tried to narrow it down on a small repro and got this (executed in > spark-shell): > {noformat} > scala> val df = Seq( > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("11.88")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("11.88")), > | ("a", BigDecimal("11.88")) > | ).toDF("text", "number") > df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)] > scala> val df_grouped_1 = > df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number")) > df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: > decimal(38,22)] > scala> df_grouped_1.collect() > res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143]) > scala> val df_grouped_2 = > df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number")) > df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: > decimal(38,22)] > scala> df_grouped_2.collect() > res1: Array[org.apache.spark.sql.Row] = > Array([a,11948571.4285714285714285714286]) > scala> val df_total_sum = > df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number")) > df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)] > scala> df_total_sum.collect() > res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143]) > {noformat} > The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the > result of {{df_grouped_2}} is clearly incorrect (it is the value of the > correct result times {{10^14}}). > When codegen is disabled all results are correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen
[ https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24957: Labels: correctness (was: ) > Decimal arithmetic can lead to wrong values using codegen > - > > Key: SPARK-24957 > URL: https://issues.apache.org/jira/browse/SPARK-24957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: David Vogelbacher >Priority: Major > Labels: correctness > > I noticed a bug when doing arithmetic on a dataframe containing decimal > values with codegen enabled. > I tried to narrow it down on a small repro and got this (executed in > spark-shell): > {noformat} > scala> val df = Seq( > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("11.88")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("12.0")), > | ("a", BigDecimal("11.88")), > | ("a", BigDecimal("11.88")) > | ).toDF("text", "number") > df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)] > scala> val df_grouped_1 = > df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number")) > df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: > decimal(38,22)] > scala> df_grouped_1.collect() > res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143]) > scala> val df_grouped_2 = > df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number")) > df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: > decimal(38,22)] > scala> df_grouped_2.collect() > res1: Array[org.apache.spark.sql.Row] = > Array([a,11948571.4285714285714285714286]) > scala> val df_total_sum = > df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number")) > df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)] > scala> df_total_sum.collect() > res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143]) > {noformat} > The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the > result of {{df_grouped_2}} is clearly incorrect (it is the value of the > correct result times {{10^14}}). > When codegen is disabled all results are correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561798#comment-16561798 ] Jackey Lee edited comment on SPARK-24630 at 7/30/18 11:38 AM: -- For Stream Table DDL,we have a better way to deal with(such as the following example). Firstly, we used a more compatible way to handle Stream Table, user no need to learn other more syntax; Secondly, the Stream Table, kafka_sql_test, can be used to not only Source but also Sink. Such as, user A can write Stream Data into kafka_sql_test, and user B can read from kafka_sql_test. This is a closed loop in data processing, and users can run the stream processing of the entire business smoothly through SQLStreaming. Thirdly, the Stream Table created by this sql just seems like Hive Table. User even can run batch queries on this Stream Table. Finally, using this way to deal with STREAM Table, fewer code will be added or changed in spark source. {code:sql} create table kafka_sql_test using kafka options( table.isstreaming = 'true', subscribe = 'topic', kafka.bootstrap.servers = 'localhost:9092') {code} was (Author: jackey lee): For Stream Table DDL,we have a better way to deal with(such as the following example). Firstly, we used a more compatible way to handle Stream Table, user no need to learn other more syntax; Secondly, the Stream Table, kafka_sql_test, can be used to not only Source but also Sink. Such as, user A can write Stream Data into kafka_sql_test, and user B can read from kafka_sql_test. This is a closed loop in data processing, and users can run the stream processing of the entire business smoothly through SQLStreaming. Finally, using this way to deal with STREAM Table, fewer code will be added or changed in spark source. {code:sql} create table kafka_sql_test using kafka options( table.isstreaming = 'true', subscribe = 'topic', kafka.bootstrap.servers = 'localhost:9092') {code} > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561798#comment-16561798 ] Jackey Lee commented on SPARK-24630: For Stream Table DDL,we have a better way to deal with(such as the following example). Firstly, we used a more compatible way to handle Stream Table, user no need to learn other more syntax; Secondly, the Stream Table, kafka_sql_test, can be used to not only Source but also Sink. Such as, user A can write Stream Data into kafka_sql_test, and user B can read from kafka_sql_test. This is a closed loop in data processing, and users can run the stream processing of the entire business smoothly through SQLStreaming. Finally, using this way to deal with STREAM Table, fewer code will be added or changed in spark source. {code:sql} create table kafka_sql_test using kafka options( table.isstreaming = 'true', subscribe = 'topic', kafka.bootstrap.servers = 'localhost:9092') {code} > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561762#comment-16561762 ] Genmao Yu edited comment on SPARK-24630 at 7/30/18 11:27 AM: - Practice to add the StreamSQL DDL, like this: {code:java} spark-sql> create source stream service_logs with ("type"="kafka","bootstrap.servers"="x.x.x.x:9092", "topic"="service_log"); Time taken: 1.833 seconds spark-sql> desc service_logs; key binary NULL value binary NULL topic string NULL partition int NULL offset bigint NULL timestamp timestamp NULL timestampType int NULL Time taken: 0.394 seconds, Fetched 7 row(s) spark-sql> create sink stream analysis with ("type"="kafka", "outputMode"="complete", "bootstrap.servers"="x.x.x.x:9092", "topic"="analysis", "checkpointLocation"="hdfs:///tmp/cp0"); Time taken: 0.027 seconds spark-sql> insert into analysis select key, value, count(*) from service_logs group by key, value; Time taken: 0.355 seconds {code} was (Author: unclegen): Try to add the StreamSQL DDL, like this: {code:java} spark-sql> create source stream service_logs with ("type"="kafka","bootstrap.servers"="x.x.x.x:9092", "topic"="service_log"); Time taken: 1.833 seconds spark-sql> desc service_logs; key binary NULL value binary NULL topic string NULL partition int NULL offset bigint NULL timestamp timestamp NULL timestampType int NULL Time taken: 0.394 seconds, Fetched 7 row(s) spark-sql> create sink stream analysis with ("type"="kafka", "outputMode"="complete", "bootstrap.servers"="x.x.x.x:9092", "topic"="analysis", "checkpointLocation"="hdfs:///tmp/cp0"); Time taken: 0.027 seconds spark-sql> insert into analysis select key, value, count(*) from service_logs group by key, value; Time taken: 0.355 seconds {code} > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24946) PySpark - Allow np.Arrays and pd.Series in df.approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-24946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561784#comment-16561784 ] Paul Westenthanner commented on SPARK-24946: Yes I agree that it's rather sugar than necessary, however I'd be happy to implement it. I agree that we should just check if it's iterable and since to the py4j a list is passed we might just do something like {{percentiles_list = [is_valid(v) for v in input_iterable] where is_valid checks if all values are floats between 0 and 1.}} If you agree that this could be useful I'll be happy to create a pull request > PySpark - Allow np.Arrays and pd.Series in df.approxQuantile > > > Key: SPARK-24946 > URL: https://issues.apache.org/jira/browse/SPARK-24946 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Paul Westenthanner >Priority: Minor > Labels: DataFrame, beginner, pyspark > > As Python user it is convenient to pass a numpy array or pandas series > `{{approxQuantile}}(_col_, _probabilities_, _relativeError_)` for the > probabilities parameter. > > Especially for creating cumulative plots (say in 1% steps) it is handy to use > `approxQuantile(col, np.arange(0, 1.0, 0.01), relativeError)`. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException
[ https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bruce_zhao updated SPARK-24970: --- Description: We're using Spark streaming to consume Kinesis data, and found that it reads more data from Kinesis and is easy to touch ProvisionedThroughputExceededException *when it recovers from streaming checkpoint*. Normally, it's a WARN in spark log. But when we have multiple streaming applications (i.e., 5 applications) to consume the same Kinesis stream, the situation becomes serious. *The application will fail to recover due to the following exception in driver.* And one application failure will also affect the other running applications. {panel:title=Exception} org.apache.spark.SparkException: Job aborted due to stage failure: {color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): org.apache.spark.SparkException: Gave up after 3 retries while getting records using shard iterator, last exception: at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Caused by: *{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: Rate exceeded for shard shardId- in stream rellfsstream-an under account 1{color}.* (Service: AmazonKinesis; Status Code: 400; Error Code: ProvisionedThroughputExceededException; Request ID: d3520677-060e-14c4-8014-2886b6b75f03) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219) at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195) at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004) at com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269) {panel} After check the source code, we found it calls getBlockFromKinesis() to recover data and in this function it accesses Kinesis directly to read data. As all partitions in the BlockRDD will access Kinesis, and AWS Kinesis only supports 5 concurrency reads per shard per second, it will touch ProvisionedThroughputExceededException easily. Even the code does some retries, it's still easy to fail when con
[jira] [Updated] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException
[ https://issues.apache.org/jira/browse/SPARK-24970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bruce_zhao updated SPARK-24970: --- Description: We're using Spark streaming to consume Kinesis data, and found that it reads more data from Kinesis and is easy to touch ProvisionedThroughputExceededException *when it recovers from streaming checkpoint*. Normally, it's a WARN in spark log. But when we have multiple streaming applications (i.e., 5 applications) to consume the same Kinesis stream, the situation becomes serious. *The application will fail to recover due to the following exception in driver.* And one application failure will also affect the other running applications. {panel:title=Exception} org.apache.spark.SparkException: Job aborted due to stage failure: {color:#ff}*Task 5 in stage 7.0 failed 4 times, most recent failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): org.apache.spark.SparkException: Gave up after 3 retries while getting records using shard iterator, last exception: at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Caused by: *{color:#ff}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: Rate exceeded for shard shardId- in stream rellfsstream-an under account 1{color}.* (Service: AmazonKinesis; Status Code: 400; Error Code: ProvisionedThroughputExceededException; Request ID: d3520677-060e-14c4-8014-2886b6b75f03) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219) at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195) at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004) at com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269) {panel} After check the source code, we found it calls getBlockFromKinesis() to recover data and in this function it accesses Kinesis directly to read data. As all partitions in the BlockRDD will access Kinesis, and AWS Kinesis only supports 5 concurrency reads per shard per second, it will touch ProvisionedThroughputExceededException easily. Even the code does some retries, it's still easy to fail whe
[jira] [Created] (SPARK-24970) Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException
bruce_zhao created SPARK-24970: -- Summary: Spark Kinesis streaming application fails to recover from streaming checkpoint due to ProvisionedThroughputExceededException Key: SPARK-24970 URL: https://issues.apache.org/jira/browse/SPARK-24970 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.2.0 Reporter: bruce_zhao We're using Spark streaming to consume Kinesis data, and found that it reads more data from Kinesis and is easy to touch ProvisionedThroughputExceededException *when it recovers from streaming checkpoint*. Normally, it's a WARN in spark log. But when we have multiple streaming applications (i.e., 5 applications) to consume the same Kinesis stream, the situation becomes serious. *The application will fail to recover due to the following exception in driver.* And one application failure will also affect the other running applications. {panel:title=Exception} org.apache.spark.SparkException: Job aborted due to stage failure: {color:#FF}*Task 5 in stage 7.0 failed 4 times, most recent failure*:{color} Lost task 5.3 in stage 7.0 (TID 128, ip-172-31-14-36.ap-northeast-1.compute.internal, executor 1): org.apache.spark.SparkException: Gave up after 3 retries while getting records using shard iterator, last exception: at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecordsAndNextKinesisIterator(KinesisBackedBlockRDD.scala:223) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:207) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Caused by: *{color:#FF}com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: Rate exceeded for shard shardId- in stream rellfsstream-an under account 1{color}.* (Service: AmazonKinesis; Status Code: 400; Error Code: ProvisionedThroughputExceededException; Request ID: d3520677-060e-14c4-8014-2886b6b75f03) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511) at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2219) at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2195) at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1004) at com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:980) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$2.apply(KinesisBackedBlockRDD.scala:224) at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269) {panel} After check the source code, we found it calls getBlockFromKinesis() to recover data and in this function it accesses Kinesis directly to read d
[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Genmao Yu updated SPARK-24630: -- Attachment: (was: image-2018-07-30-18-48-38-352.png) > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Genmao Yu updated SPARK-24630: -- Attachment: (was: image-2018-07-30-18-06-30-506.png) > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561762#comment-16561762 ] Genmao Yu edited comment on SPARK-24630 at 7/30/18 10:53 AM: - Try to add the StreamSQL DDL, like this: {code:java} spark-sql> create source stream service_logs with ("type"="kafka","bootstrap.servers"="x.x.x.x:9092", "topic"="service_log"); Time taken: 1.833 seconds spark-sql> desc service_logs; key binary NULL value binary NULL topic string NULL partition int NULL offset bigint NULL timestamp timestamp NULL timestampType int NULL Time taken: 0.394 seconds, Fetched 7 row(s) spark-sql> create sink stream analysis with ("type"="kafka", "outputMode"="complete", "bootstrap.servers"="x.x.x.x:9092", "topic"="analysis", "checkpointLocation"="hdfs:///tmp/cp0"); Time taken: 0.027 seconds spark-sql> insert into analysis select key, value, count(*) from service_logs group by key, value; Time taken: 0.355 seconds {code} was (Author: unclegen): Try to add the StreamSQL DDL, like this: !image-2018-07-30-18-48-38-352.png! > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf, > image-2018-07-30-18-06-30-506.png, image-2018-07-30-18-48-38-352.png > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561762#comment-16561762 ] Genmao Yu commented on SPARK-24630: --- Try to add the StreamSQL DDL, like this: !image-2018-07-30-18-48-38-352.png! > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf, > image-2018-07-30-18-06-30-506.png, image-2018-07-30-18-48-38-352.png > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Genmao Yu updated SPARK-24630: -- Attachment: image-2018-07-30-18-48-38-352.png > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf, > image-2018-07-30-18-06-30-506.png, image-2018-07-30-18-48-38-352.png > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org