[jira] [Assigned] (SPARK-40144) Standalone log-view can't load new
[ https://issues.apache.org/jira/browse/SPARK-40144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40144: Assignee: Apache Spark > Standalone log-view can't load new > --- > > Key: SPARK-40144 > URL: https://issues.apache.org/jira/browse/SPARK-40144 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Obobj >Assignee: Apache Spark >Priority: Minor > > log-view.js load new needs to call getBaseURI() of the utils.js file, but > does not reference utils.js -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40144) Standalone log-view can't load new
[ https://issues.apache.org/jira/browse/SPARK-40144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40144: Assignee: (was: Apache Spark) > Standalone log-view can't load new > --- > > Key: SPARK-40144 > URL: https://issues.apache.org/jira/browse/SPARK-40144 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Obobj >Priority: Minor > > log-view.js load new needs to call getBaseURI() of the utils.js file, but > does not reference utils.js -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40144) Standalone log-view can't load new
[ https://issues.apache.org/jira/browse/SPARK-40144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581664#comment-17581664 ] Apache Spark commented on SPARK-40144: -- User 'obobj' has created a pull request for this issue: https://github.com/apache/spark/pull/37577 > Standalone log-view can't load new > --- > > Key: SPARK-40144 > URL: https://issues.apache.org/jira/browse/SPARK-40144 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Obobj >Priority: Minor > > log-view.js load new needs to call getBaseURI() of the utils.js file, but > does not reference utils.js -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40144) Standalone log-view can't load new
[ https://issues.apache.org/jira/browse/SPARK-40144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581665#comment-17581665 ] Apache Spark commented on SPARK-40144: -- User 'obobj' has created a pull request for this issue: https://github.com/apache/spark/pull/37577 > Standalone log-view can't load new > --- > > Key: SPARK-40144 > URL: https://issues.apache.org/jira/browse/SPARK-40144 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Obobj >Priority: Minor > > log-view.js load new needs to call getBaseURI() of the utils.js file, but > does not reference utils.js -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30115) Improve limit only query on datasource table
[ https://issues.apache.org/jira/browse/SPARK-30115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30115: Assignee: Apache Spark > Improve limit only query on datasource table > > > Key: SPARK-30115 > URL: https://issues.apache.org/jira/browse/SPARK-30115 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30115) Improve limit only query on datasource table
[ https://issues.apache.org/jira/browse/SPARK-30115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30115: Assignee: (was: Apache Spark) > Improve limit only query on datasource table > > > Key: SPARK-30115 > URL: https://issues.apache.org/jira/browse/SPARK-30115 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40144) Standalone log-view can't load new
Obobj created SPARK-40144: - Summary: Standalone log-view can't load new Key: SPARK-40144 URL: https://issues.apache.org/jira/browse/SPARK-40144 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 3.0.0 Reporter: Obobj log-view.js load new needs to call getBaseURI() of the utils.js file, but does not reference utils.js -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40005) Self-contained examples with parameter descriptions in PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-40005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581658#comment-17581658 ] Hyukjin Kwon commented on SPARK-40005: -- I removed ML since ML has its own dedicated docs https://spark.apache.org/docs/latest/ml-guide.html > Self-contained examples with parameter descriptions in PySpark documentation > > > Key: SPARK-40005 > URL: https://issues.apache.org/jira/browse/SPARK-40005 > Project: Spark > Issue Type: Umbrella > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Critical > > This JIRA aims to improve PySpark documentation in: > - {{pyspark}} > - {{pyspark.sql}} > - {{pyspark.sql.streaming}} > We should: > - Make the examples self-contained, e.g., > https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html > - Document {{Parameters}} > https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot. > There are many API that misses parameters in PySpark, e.g., > [DataFrame.union|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html#pyspark.sql.DataFrame.union] > If the size of file is large, e.g., dataframe.py, we should split that down > into each subtask, and improve documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40005) Self-contained examples with parameter descriptions in PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-40005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40005: - Description: This JIRA aims to improve PySpark documentation in: - {{pyspark}} - {{pyspark.sql}} - {{pyspark.sql.streaming}} We should: - Make the examples self-contained, e.g., https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html - Document {{Parameters}} https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot. There are many API that misses parameters in PySpark, e.g., [DataFrame.union|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html#pyspark.sql.DataFrame.union] If the size of file is large, e.g., dataframe.py, we should split that down into each subtask, and improve documentation. was: This JIRA aims to improve PySpark documentation in: - {{pyspark}} - {{pyspark.ml}} - {{pyspark.sql}} - {{pyspark.sql.streaming}} We should: - Make the examples self-contained, e.g., https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html - Document {{Parameters}} https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot. There are many API that misses parameters in PySpark, e.g., [DataFrame.union|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html#pyspark.sql.DataFrame.union] If the size of file is large, e.g., dataframe.py, we should split that down into each subtask, and improve documentation. > Self-contained examples with parameter descriptions in PySpark documentation > > > Key: SPARK-40005 > URL: https://issues.apache.org/jira/browse/SPARK-40005 > Project: Spark > Issue Type: Umbrella > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Critical > > This JIRA aims to improve PySpark documentation in: > - {{pyspark}} > - {{pyspark.sql}} > - {{pyspark.sql.streaming}} > We should: > - Make the examples self-contained, e.g., > https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html > - Document {{Parameters}} > https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot. > There are many API that misses parameters in PySpark, e.g., > [DataFrame.union|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html#pyspark.sql.DataFrame.union] > If the size of file is large, e.g., dataframe.py, we should split that down > into each subtask, and improve documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35542) Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it.
[ https://issues.apache.org/jira/browse/SPARK-35542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu resolved SPARK-35542. Fix Version/s: 3.3.1 3.1.4 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 37568 [https://github.com/apache/spark/pull/37568] > Bucketizer created for multiple columns with parameters splitsArray, > inputCols and outputCols can not be loaded after saving it. > - > > Key: SPARK-35542 > URL: https://issues.apache.org/jira/browse/SPARK-35542 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.1 > Environment: {color:#172b4d}DataBricks Spark 3.1.1{color} >Reporter: Srikanth Pusarla >Assignee: Weichen Xu >Priority: Minor > Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0 > > Attachments: Code-error.PNG, traceback.png > > > Bucketizer created for multiple columns with parameters *splitsArray*, > *inputCols* and *outputCols* can not be loaded after saving it. > The problem is not seen for Bucketizer created for single column. > *Code to reproduce* > ### > from pyspark.ml.feature import Bucketizer > df = spark.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"]) > bucketizer = Bucketizer(*splitsArray*= [[-float("inf"), 0.5, 1.4, > float("inf")], [-float("inf"), 0.1, 1.2, float("inf")]], > *inputCols*=["values", "values"], *outputCols*=["b1", "b2"]) > bucketed = bucketizer.transform(df).collect() > dfb = bucketizer.transform(df) > print(dfb.show()) > bucketizerPath = "dbfs:/mnt/S3-Bucket/" + "Bucketizer" > bucketizer.write().overwrite().save(bucketizerPath) > loadedBucketizer = {color:#FF}Bucketizer.load(bucketizerPath) > Failing here{color} > loadedBucketizer.getSplits() == bucketizer.getSplits() > > The error message is > {color:#FF}*TypeError: array() argument 1 must be a unicode character, > not bytes*{color} > > *BackTrace:* > > -- > TypeError Traceback (most recent call last) in 15 > 16 bucketizer.write().overwrite().save(bucketizerPath) > ---> 17 loadedBucketizer = Bucketizer.load(bucketizerPath) > 18 loadedBucketizer.getSplits() == bucketizer.getSplits() > > /databricks/spark/python/pyspark/ml/util.py in load(cls, path) > 376 def load(cls, path): > 377 """Reads an ML instance from the input path, a shortcut of > `read().load(path)`.""" > --> 378 return cls.read().load(path) > 379 > 380 > > /databricks/spark/python/pyspark/ml/util.py in load(self, path) > 330 raise NotImplementedError("This Java ML type cannot be loaded into Python > currently: %r" > 331 % self._clazz) > --> 332 return self._clazz._from_java(java_obj) > 333 > 334 > > def session(self, sparkSession): > /databricks/spark/python/pyspark/ml/wrapper.py in _from_java(java_stage) > 258 > 259 py_stage._resetUid(java_stage.uid()) > --> 260 py_stage._transfer_params_from_java() > 261 elif hasattr(py_type, "_from_java"): > 262 py_stage = py_type._from_java(java_stage) > > /databricks/spark/python/pyspark/ml/wrapper.py in > _transfer_params_from_java(self) > 186 # SPARK-14931: Only check set params back to avoid default params > mismatch. > 187 if self._java_obj.isSet(java_param): --> > 188 value = _java2py(sc, self._java_obj.getOrDefault(java_param)) > 189 self._set(**{param.name: value}) > 190 # SPARK-10931: Temporary fix for params that have a default in Java > > /databricks/spark/python/pyspark/ml/common.py in _java2py(sc, r, encoding) > 107 > 108 if isinstance(r, (bytearray, bytes)): > --> 109 r = PickleSerializer().loads(bytes(r), encoding=encoding) > 110 return r > 111 > > /databricks/spark/python/pyspark/serializers.py in loads(self, obj, encoding) > 467 > 468 def loads(self, obj, encoding="bytes"): > --> 469 return pickle.loads(obj, encoding=encoding) > 470 > 471 > > TypeError: array() argument 1 must be a unicode character, not bytes > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40120) Make pyspark.sql.readwriter examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40120. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37570 [https://github.com/apache/spark/pull/37570] > Make pyspark.sql.readwriter examples self-contained > --- > > Key: SPARK-40120 > URL: https://issues.apache.org/jira/browse/SPARK-40120 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40120) Make pyspark.sql.readwriter examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40120: Assignee: Hyukjin Kwon > Make pyspark.sql.readwriter examples self-contained > --- > > Key: SPARK-40120 > URL: https://issues.apache.org/jira/browse/SPARK-40120 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40143) ANSI mode: allow explicitly casting fraction strings as Integral types
[ https://issues.apache.org/jira/browse/SPARK-40143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581635#comment-17581635 ] Apache Spark commented on SPARK-40143: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37576 > ANSI mode: allow explicitly casting fraction strings as Integral types > --- > > Key: SPARK-40143 > URL: https://issues.apache.org/jira/browse/SPARK-40143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > It's part of the ANSI SQL standard, and more consistent with the non-ansi > casting. We can have different behavior for implicit casting to avoid `'1.2' > = 1` returns true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40143) ANSI mode: allow explicitly casting fraction strings as Integral types
[ https://issues.apache.org/jira/browse/SPARK-40143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581633#comment-17581633 ] Apache Spark commented on SPARK-40143: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37576 > ANSI mode: allow explicitly casting fraction strings as Integral types > --- > > Key: SPARK-40143 > URL: https://issues.apache.org/jira/browse/SPARK-40143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > It's part of the ANSI SQL standard, and more consistent with the non-ansi > casting. We can have different behavior for implicit casting to avoid `'1.2' > = 1` returns true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40143) ANSI mode: allow explicitly casting fraction strings as Integral types
[ https://issues.apache.org/jira/browse/SPARK-40143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40143: Assignee: Gengliang Wang (was: Apache Spark) > ANSI mode: allow explicitly casting fraction strings as Integral types > --- > > Key: SPARK-40143 > URL: https://issues.apache.org/jira/browse/SPARK-40143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > It's part of the ANSI SQL standard, and more consistent with the non-ansi > casting. We can have different behavior for implicit casting to avoid `'1.2' > = 1` returns true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40143) ANSI mode: allow explicitly casting fraction strings as Integral types
[ https://issues.apache.org/jira/browse/SPARK-40143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40143: Assignee: Apache Spark (was: Gengliang Wang) > ANSI mode: allow explicitly casting fraction strings as Integral types > --- > > Key: SPARK-40143 > URL: https://issues.apache.org/jira/browse/SPARK-40143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > It's part of the ANSI SQL standard, and more consistent with the non-ansi > casting. We can have different behavior for implicit casting to avoid `'1.2' > = 1` returns true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37130) why spark-X.X.X-bin-without-hadoop.tgz does not provide spark-hive_X.jar (and spark-hive-thriftserver_X.jar)
[ https://issues.apache.org/jira/browse/SPARK-37130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581631#comment-17581631 ] Victor Tso commented on SPARK-37130: I have the same question. We want to have Hadoop provided by the environment, but at minimum spark-hive should be there, as the environment couldn't possibly provide that. > why spark-X.X.X-bin-without-hadoop.tgz does not provide spark-hive_X.jar (and > spark-hive-thriftserver_X.jar) > > > Key: SPARK-37130 > URL: https://issues.apache.org/jira/browse/SPARK-37130 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.1.2, 3.2.0 >Reporter: Patrice DUROUX >Priority: Minor > > Hi, > As my deployment is having its own Hadoop(+Hive) installed, I have tried to > install Spark using its bundle without Hadoop. I suspect that some jars are > missing that are present in the corresponding spark-X.X.X-bin-hadoop3.2.tgz. > After comparing their contents both spark-hive_2.12-X.X.X.jar and > spark-hive-thriftserver_2.12-X.X.X.jar are not in the > spark-X.X.X-bin-without---hadoop.tgz. And I don't know if some others should > also be there. > Thanks, > Patrice > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40143) ANSI mode: allow explicitly casting fraction strings as Integral types
[ https://issues.apache.org/jira/browse/SPARK-40143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-40143: --- Summary: ANSI mode: allow explicitly casting fraction strings as Integral types (was: ANSI mode: explicitly casting String as Integral types should allow fraction strings) > ANSI mode: allow explicitly casting fraction strings as Integral types > --- > > Key: SPARK-40143 > URL: https://issues.apache.org/jira/browse/SPARK-40143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > It's part of the ANSI SQL standard, and more consistent with the non-ansi > casting. We can have different behavior for implicit casting to avoid `'1.2' > = 1` returns true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40143) ANSI mode: explicitly casting String as Integral types should allow fraction strings
Gengliang Wang created SPARK-40143: -- Summary: ANSI mode: explicitly casting String as Integral types should allow fraction strings Key: SPARK-40143 URL: https://issues.apache.org/jira/browse/SPARK-40143 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang It's part of the ANSI SQL standard, and more consistent with the non-ansi casting. We can have different behavior for implicit casting to avoid `'1.2' = 1` returns true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581628#comment-17581628 ] Apache Spark commented on SPARK-40142: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37575 > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40142: Assignee: (was: Apache Spark) > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40142: Assignee: Apache Spark > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581627#comment-17581627 ] Apache Spark commented on SPARK-40142: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37575 > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40142) Make pyspark.sql.functions examples self-contained
Hyukjin Kwon created SPARK-40142: Summary: Make pyspark.sql.functions examples self-contained Key: SPARK-40142 URL: https://issues.apache.org/jira/browse/SPARK-40142 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39271) Upgrade pandas to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-39271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39271. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37567 [https://github.com/apache/spark/pull/37567] > Upgrade pandas to 1.4.3 > --- > > Key: SPARK-39271 > URL: https://issues.apache.org/jira/browse/SPARK-39271 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40081) Add Document Parameters for pyspark.sql.streaming.query
[ https://issues.apache.org/jira/browse/SPARK-40081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581606#comment-17581606 ] Hyukjin Kwon commented on SPARK-40081: -- [~dcoliversun] just asking. are you working on this? > Add Document Parameters for pyspark.sql.streaming.query > --- > > Key: SPARK-40081 > URL: https://issues.apache.org/jira/browse/SPARK-40081 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39271) Upgrade pandas to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-39271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39271: Assignee: Yikun Jiang > Upgrade pandas to 1.4.3 > --- > > Key: SPARK-39271 > URL: https://issues.apache.org/jira/browse/SPARK-39271 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37751) Apache Commons Crypto doesn't support Java 11
[ https://issues.apache.org/jira/browse/SPARK-37751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581582#comment-17581582 ] Qiyuan Gong commented on SPARK-37751: - Hi [~benoit_roy] . Some hot fix for this issue: # Change back to Java 8 if possible. # Use Kernel 5.4 or higher. We found this reduce the possibility of this error. > Apache Commons Crypto doesn't support Java 11 > - > > Key: SPARK-37751 > URL: https://issues.apache.org/jira/browse/SPARK-37751 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 3.1.2, 3.2.0 > Environment: Spark 3.2.0 on kubernetes >Reporter: Shipeng Feng >Priority: Major > > For kubernetes, we are using Java 11 in docker, > [https://github.com/apache/spark/blob/v3.2.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile:] > {code:java} > ARG java_image_tag=11-jre-slim > {code} > We have a simple app: > {code:scala} > object SimpleApp { > def main(args: Array[String]) { > val session = SparkSession.builder.getOrCreate > > // the size of demo.csv is 5GB > val rdd = session.read.option("header", "true").option("inferSchema", > "true").csv("/data/demo.csv").rdd > val lines = rdd.repartition(200) > val count = lines.count() > } > } > {code} > > Enable AES-based encryption for RPC connection by the following config: > {code:java} > --conf spark.authenticate=true > --conf spark.network.crypto.enabled=true > {code} > This would cause the following error: > {code:java} > java.lang.IllegalArgumentException: Frame length should be positive: > -6119185687804983867 > at > org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:150) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) {code} > The error disappears in 8-jre-slim. It seems that Apache Commons Crypto 1.1.0 > only works with Java 8: > [https://commons.apache.org/proper/commons-crypto/download_crypto.cgi] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40140) REST API for SQL level information does not show information on running queries
[ https://issues.apache.org/jira/browse/SPARK-40140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581580#comment-17581580 ] ming95 commented on SPARK-40140: I'm interested in this issue and I can fix it, if no one else is working on it. :) > REST API for SQL level information does not show information on running > queries > --- > > Key: SPARK-40140 > URL: https://issues.apache.org/jira/browse/SPARK-40140 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > Hi All, > We noticed that the SQL information REST API implemented in > https://issues.apache.org/jira/browse/SPARK-27142 does not return back SQL > queries which are currently running. We can only see queries which are > completed/failed. > As far as I can see, this should be supported since one of the fields in the > returned JSON is "runningJobIds". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40141) Task listener overloads no longer needed with JDK 8+
[ https://issues.apache.org/jira/browse/SPARK-40141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40141: Assignee: (was: Apache Spark) > Task listener overloads no longer needed with JDK 8+ > > > Key: SPARK-40141 > URL: https://issues.apache.org/jira/browse/SPARK-40141 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Priority: Major > > TaskContext defines methods for registering completion and failure listeners, > and the respective listener types qualify as functional interfaces in JDK 8+. > This leads to awkward ambiguous overload errors with the overload of each > function, that takes a function directly instead of a listener. Now that JDK > 8 is the minimum allowed, we can remove the unnecessary overloads, which not > only simplifies the code, but also removes a source of frustration since it > can be nearly impossible to predict when an ambiguous overload might be > triggered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40141) Task listener overloads no longer needed with JDK 8+
[ https://issues.apache.org/jira/browse/SPARK-40141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581569#comment-17581569 ] Apache Spark commented on SPARK-40141: -- User 'ryan-johnson-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/37573 > Task listener overloads no longer needed with JDK 8+ > > > Key: SPARK-40141 > URL: https://issues.apache.org/jira/browse/SPARK-40141 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Priority: Major > > TaskContext defines methods for registering completion and failure listeners, > and the respective listener types qualify as functional interfaces in JDK 8+. > This leads to awkward ambiguous overload errors with the overload of each > function, that takes a function directly instead of a listener. Now that JDK > 8 is the minimum allowed, we can remove the unnecessary overloads, which not > only simplifies the code, but also removes a source of frustration since it > can be nearly impossible to predict when an ambiguous overload might be > triggered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40141) Task listener overloads no longer needed with JDK 8+
[ https://issues.apache.org/jira/browse/SPARK-40141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40141: Assignee: Apache Spark > Task listener overloads no longer needed with JDK 8+ > > > Key: SPARK-40141 > URL: https://issues.apache.org/jira/browse/SPARK-40141 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Assignee: Apache Spark >Priority: Major > > TaskContext defines methods for registering completion and failure listeners, > and the respective listener types qualify as functional interfaces in JDK 8+. > This leads to awkward ambiguous overload errors with the overload of each > function, that takes a function directly instead of a listener. Now that JDK > 8 is the minimum allowed, we can remove the unnecessary overloads, which not > only simplifies the code, but also removes a source of frustration since it > can be nearly impossible to predict when an ambiguous overload might be > triggered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers
[ https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36462: -- Affects Version/s: 3.4.0 (was: 3.2.0) (was: 3.3.0) > Allow Spark on Kube to operate without polling or watchers > -- > > Key: SPARK-36462 > URL: https://issues.apache.org/jira/browse/SPARK-36462 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.4.0 > > > Add an option to Spark on Kube to not track the individual executor pods and > just assume K8s is doing what it's asked. This would be a developer feature > intended for minimizing load on etcd & driver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers
[ https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-36462: - Assignee: Holden Karau > Allow Spark on Kube to operate without polling or watchers > -- > > Key: SPARK-36462 > URL: https://issues.apache.org/jira/browse/SPARK-36462 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0, 3.3.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > Add an option to Spark on Kube to not track the individual executor pods and > just assume K8s is doing what it's asked. This would be a developer feature > intended for minimizing load on etcd & driver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers
[ https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36462. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36433 [https://github.com/apache/spark/pull/36433] > Allow Spark on Kube to operate without polling or watchers > -- > > Key: SPARK-36462 > URL: https://issues.apache.org/jira/browse/SPARK-36462 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.0, 3.3.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.4.0 > > > Add an option to Spark on Kube to not track the individual executor pods and > just assume K8s is doing what it's asked. This would be a developer feature > intended for minimizing load on etcd & driver. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40141) Task listener overloads no longer needed with JDK 8+
Ryan Johnson created SPARK-40141: Summary: Task listener overloads no longer needed with JDK 8+ Key: SPARK-40141 URL: https://issues.apache.org/jira/browse/SPARK-40141 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Ryan Johnson TaskContext defines methods for registering completion and failure listeners, and the respective listener types qualify as functional interfaces in JDK 8+. This leads to awkward ambiguous overload errors with the overload of each function, that takes a function directly instead of a listener. Now that JDK 8 is the minimum allowed, we can remove the unnecessary overloads, which not only simplifies the code, but also removes a source of frustration since it can be nearly impossible to predict when an ambiguous overload might be triggered. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40106) Task failure handlers should always run if the task failed
[ https://issues.apache.org/jira/browse/SPARK-40106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-40106: -- Assignee: Ryan Johnson > Task failure handlers should always run if the task failed > -- > > Key: SPARK-40106 > URL: https://issues.apache.org/jira/browse/SPARK-40106 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Assignee: Ryan Johnson >Priority: Major > Fix For: 3.4.0 > > > Today, if a task body succeeds, but a task completion listener fails, task > failure listeners are not called -- even tho the task has indeed failed at > that point. > If a completion listener fails, and failure listeners were not previously > invoked, we should invoke them before running the remaining completion > listeners. > Such a change would increase the utility of task listeners, especially ones > intended to assist with task cleanup. > To give one arbitrary example, code like this appears at several places in > the code (taken from {{executeTask}} method of FileFormatWriter.scala): > {code:java} > try { > Utils.tryWithSafeFinallyAndFailureCallbacks(block = { > // Execute the task to write rows out and commit the task. > dataWriter.writeWithIterator(iterator) > dataWriter.commit() > })(catchBlock = { > // If there is an error, abort the task > dataWriter.abort() > logError(s"Job $jobId aborted.") > }, finallyBlock = { > dataWriter.close() > }) > } catch { > case e: FetchFailedException => > throw e > case f: FileAlreadyExistsException if > SQLConf.get.fastFailFileFormatOutput => > // If any output file to write already exists, it does not make sense > to re-run this task. > // We throw the exception and let Executor throw ExceptionFailure to > abort the job. > throw new TaskOutputFileAlreadyExistException(f) > case t: Throwable => > throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t) > }{code} > If failure listeners were reliably called, the above idiom could potentially > be factored out as two failure listeners plus a completion listener, and > reused rather than duplicated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40106) Task failure handlers should always run if the task failed
[ https://issues.apache.org/jira/browse/SPARK-40106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-40106. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37531 [https://github.com/apache/spark/pull/37531] > Task failure handlers should always run if the task failed > -- > > Key: SPARK-40106 > URL: https://issues.apache.org/jira/browse/SPARK-40106 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Ryan Johnson >Priority: Major > Fix For: 3.4.0 > > > Today, if a task body succeeds, but a task completion listener fails, task > failure listeners are not called -- even tho the task has indeed failed at > that point. > If a completion listener fails, and failure listeners were not previously > invoked, we should invoke them before running the remaining completion > listeners. > Such a change would increase the utility of task listeners, especially ones > intended to assist with task cleanup. > To give one arbitrary example, code like this appears at several places in > the code (taken from {{executeTask}} method of FileFormatWriter.scala): > {code:java} > try { > Utils.tryWithSafeFinallyAndFailureCallbacks(block = { > // Execute the task to write rows out and commit the task. > dataWriter.writeWithIterator(iterator) > dataWriter.commit() > })(catchBlock = { > // If there is an error, abort the task > dataWriter.abort() > logError(s"Job $jobId aborted.") > }, finallyBlock = { > dataWriter.close() > }) > } catch { > case e: FetchFailedException => > throw e > case f: FileAlreadyExistsException if > SQLConf.get.fastFailFileFormatOutput => > // If any output file to write already exists, it does not make sense > to re-run this task. > // We throw the exception and let Executor throw ExceptionFailure to > abort the job. > throw new TaskOutputFileAlreadyExistException(f) > case t: Throwable => > throw QueryExecutionErrors.taskFailedWhileWritingRowsError(t) > }{code} > If failure listeners were reliably called, the above idiom could potentially > be factored out as two failure listeners plus a completion listener, and > reused rather than duplicated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40000) Add config to toggle whether to automatically add default values for INSERTs without user-specified fields
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4: Assignee: (was: Apache Spark) > Add config to toggle whether to automatically add default values for INSERTs > without user-specified fields > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40000) Add config to toggle whether to automatically add default values for INSERTs without user-specified fields
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4: Assignee: Apache Spark > Add config to toggle whether to automatically add default values for INSERTs > without user-specified fields > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40000) Add config to toggle whether to automatically add default values for INSERTs without user-specified fields
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-4: --- Assignee: (was: Daniel) This is reverted via https://github.com/apache/spark/commit/50c163578cfef79002fbdbc54b3b8fc10cfbcf65 > Add config to toggle whether to automatically add default values for INSERTs > without user-specified fields > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40000) Add config to toggle whether to automatically add default values for INSERTs without user-specified fields
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-4: -- Fix Version/s: (was: 3.4.0) > Add config to toggle whether to automatically add default values for INSERTs > without user-specified fields > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37751) Apache Commons Crypto doesn't support Java 11
[ https://issues.apache.org/jira/browse/SPARK-37751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581506#comment-17581506 ] Benoit Roy edited comment on SPARK-37751 at 8/18/22 8:10 PM: - Hello, we have also encountered this issue after upgrading to Java11 (we also migrated to Spark 3.3.0 - standalone), so this also affects Spark 3.3.0 version. Any suggestions how we can resolve this? - aside from _spark.network.crypto.enabled_ ? was (Author: JIRAUSER293512): Hello, we have also encountered this issue after upgrading to Java11 (we also migrated to Spark 3.3.0 - standalone), so this appears also appears to affect Spark 3.3.0 version. Any suggestions how we can resolve this? - aside from _spark.network.crypto.enabled_ ? > Apache Commons Crypto doesn't support Java 11 > - > > Key: SPARK-37751 > URL: https://issues.apache.org/jira/browse/SPARK-37751 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 3.1.2, 3.2.0 > Environment: Spark 3.2.0 on kubernetes >Reporter: Shipeng Feng >Priority: Major > > For kubernetes, we are using Java 11 in docker, > [https://github.com/apache/spark/blob/v3.2.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile:] > {code:java} > ARG java_image_tag=11-jre-slim > {code} > We have a simple app: > {code:scala} > object SimpleApp { > def main(args: Array[String]) { > val session = SparkSession.builder.getOrCreate > > // the size of demo.csv is 5GB > val rdd = session.read.option("header", "true").option("inferSchema", > "true").csv("/data/demo.csv").rdd > val lines = rdd.repartition(200) > val count = lines.count() > } > } > {code} > > Enable AES-based encryption for RPC connection by the following config: > {code:java} > --conf spark.authenticate=true > --conf spark.network.crypto.enabled=true > {code} > This would cause the following error: > {code:java} > java.lang.IllegalArgumentException: Frame length should be positive: > -6119185687804983867 > at > org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:150) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) {code} > The error disappears in 8-jre-slim. It seems that Apache Commons Crypto 1.1.0 > only works with Java 8: > [https://commons.apache.org/proper/commons-crypto/download_crypto.cgi] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-37544) sequence over dates with month interval is producing incorrect results
[ https://issues.apache.org/jira/browse/SPARK-37544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37544: -- Fix Version/s: 3.1.4 > sequence over dates with month interval is producing incorrect results > -- > > Key: SPARK-37544 > URL: https://issues.apache.org/jira/browse/SPARK-37544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 > Environment: Ubuntu 20, OSX 11.6 > OpenJDK 11, Spark 3.2 >Reporter: Vsevolod Ostapenko >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > Fix For: 3.3.0, 3.1.4, 3.2.2 > > > Sequence function with dates and step interval in months producing unexpected > results. > Here is a sample using Spark 3.2 (though the behavior is the same in 3.1.1 > and presumably earlier): > {{scala> spark.sql("select sequence(date '2021-01-01', date '2022-01-01', > interval '3' month) x, date '2021-01-01' + interval '3' month y").collect()}} > {{res1: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01, > {color:#FF}*2021-03-31, 2021-06-30, 2021-09-30,* > {color}{color:#172b4d}2022-01-01{color}),2021-04-01])}} > Expected result of adding 3 months to the 2021-01-01 is 2021-04-01, while > sequence returns 2021-03-31. > At the same time sequence over timestamps works as expected: > {{scala> spark.sql("select sequence(timestamp '2021-01-01 00:00', timestamp > '2022-01-01 00:00', interval '3' month) x").collect()}} > {{res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01 > 00:00:00.0, *2021-04-01* 00:00:00.0, *2021-07-01* 00:00:00.0, *2021-10-01* > 00:00:00.0, 2022-01-01 00:00:00.0)])}} > > A similar issue was reported in the past - [SPARK-31654] sequence producing > inconsistent intervals for month step - ASF JIRA (apache.org) > It's marked resolved, but the problem is either resurfaced or was never > actually fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40134) Update ORC to 1.7.6
[ https://issues.apache.org/jira/browse/SPARK-40134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40134: -- Affects Version/s: 3.4.0 > Update ORC to 1.7.6 > --- > > Key: SPARK-40134 > URL: https://issues.apache.org/jira/browse/SPARK-40134 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0, 3.4.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40134) Update ORC to 1.7.6
[ https://issues.apache.org/jira/browse/SPARK-40134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40134: -- Affects Version/s: 3.3.0 > Update ORC to 1.7.6 > --- > > Key: SPARK-40134 > URL: https://issues.apache.org/jira/browse/SPARK-40134 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0, 3.4.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.4.0, 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40134) Update ORC to 1.7.6
[ https://issues.apache.org/jira/browse/SPARK-40134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40134: -- Affects Version/s: (was: 3.4.0) > Update ORC to 1.7.6 > --- > > Key: SPARK-40134 > URL: https://issues.apache.org/jira/browse/SPARK-40134 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.4.0, 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40134) Update ORC to 1.7.6
[ https://issues.apache.org/jira/browse/SPARK-40134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40134: -- Fix Version/s: 3.4.0 > Update ORC to 1.7.6 > --- > > Key: SPARK-40134 > URL: https://issues.apache.org/jira/browse/SPARK-40134 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0, 3.4.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.4.0, 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40134) Update ORC to 1.7.6
[ https://issues.apache.org/jira/browse/SPARK-40134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40134: -- Fix Version/s: (was: 3.4.0) > Update ORC to 1.7.6 > --- > > Key: SPARK-40134 > URL: https://issues.apache.org/jira/browse/SPARK-40134 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40134) Update ORC to 1.7.6
[ https://issues.apache.org/jira/browse/SPARK-40134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40134. --- Fix Version/s: 3.3.1 3.4.0 Resolution: Fixed Issue resolved by pull request 37563 [https://github.com/apache/spark/pull/37563] > Update ORC to 1.7.6 > --- > > Key: SPARK-40134 > URL: https://issues.apache.org/jira/browse/SPARK-40134 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: William Hyun >Priority: Major > Fix For: 3.3.1, 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40134) Update ORC to 1.7.6
[ https://issues.apache.org/jira/browse/SPARK-40134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40134: - Assignee: William Hyun > Update ORC to 1.7.6 > --- > > Key: SPARK-40134 > URL: https://issues.apache.org/jira/browse/SPARK-40134 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.4.0, 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37751) Apache Commons Crypto doesn't support Java 11
[ https://issues.apache.org/jira/browse/SPARK-37751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581506#comment-17581506 ] Benoit Roy edited comment on SPARK-37751 at 8/18/22 7:54 PM: - Hello, we have also encountered this issue after upgrading to Java11 (we also migrated to Spark 3.3.0 - standalone), so this appears also appears to affect Spark 3.3.0 version. Any suggestions how we can resolve this? - aside from _spark.network.crypto.enabled_ ? was (Author: JIRAUSER293512): Hello, we have also encountered this issue after upgrading to Java11 (we also migrated to Spark 3.3.0), so this appears also appears to affect Spark 3.3.0 version. Any suggestions how we can resolve this? - aside from _spark.network.crypto.enabled_ ? > Apache Commons Crypto doesn't support Java 11 > - > > Key: SPARK-37751 > URL: https://issues.apache.org/jira/browse/SPARK-37751 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 3.1.2, 3.2.0 > Environment: Spark 3.2.0 on kubernetes >Reporter: Shipeng Feng >Priority: Major > > For kubernetes, we are using Java 11 in docker, > [https://github.com/apache/spark/blob/v3.2.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile:] > {code:java} > ARG java_image_tag=11-jre-slim > {code} > We have a simple app: > {code:scala} > object SimpleApp { > def main(args: Array[String]) { > val session = SparkSession.builder.getOrCreate > > // the size of demo.csv is 5GB > val rdd = session.read.option("header", "true").option("inferSchema", > "true").csv("/data/demo.csv").rdd > val lines = rdd.repartition(200) > val count = lines.count() > } > } > {code} > > Enable AES-based encryption for RPC connection by the following config: > {code:java} > --conf spark.authenticate=true > --conf spark.network.crypto.enabled=true > {code} > This would cause the following error: > {code:java} > java.lang.IllegalArgumentException: Frame length should be positive: > -6119185687804983867 > at > org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:150) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) {code} > The error disappears in 8-jre-slim. It seems that Apache Commons Crypto 1.1.0 > only works with Java 8: > [https://commons.apache.org/proper/commons-crypto/download_crypto.cgi] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Comment Edited] (SPARK-37751) Apache Commons Crypto doesn't support Java 11
[ https://issues.apache.org/jira/browse/SPARK-37751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581506#comment-17581506 ] Benoit Roy edited comment on SPARK-37751 at 8/18/22 7:48 PM: - Hello, we have also encountered this issue after upgrading to Java11 (we also migrated to Spark 3.3.0), so this appears also appears to affect Spark 3.3.0 version. Any suggestions how we can resolve this? - aside from _spark.network.crypto.enabled_ ? was (Author: JIRAUSER293512): Hello, we have also encountered this issue after upgrading to Java11 (we also migrated to Spark 3.3.0), so this appears also appears to affect Spark 3.3.0 version. Any suggestions how we can resolve this? - aside from spark.network.crypto.{_}enabled{_} ? > Apache Commons Crypto doesn't support Java 11 > - > > Key: SPARK-37751 > URL: https://issues.apache.org/jira/browse/SPARK-37751 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 3.1.2, 3.2.0 > Environment: Spark 3.2.0 on kubernetes >Reporter: Shipeng Feng >Priority: Major > > For kubernetes, we are using Java 11 in docker, > [https://github.com/apache/spark/blob/v3.2.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile:] > {code:java} > ARG java_image_tag=11-jre-slim > {code} > We have a simple app: > {code:scala} > object SimpleApp { > def main(args: Array[String]) { > val session = SparkSession.builder.getOrCreate > > // the size of demo.csv is 5GB > val rdd = session.read.option("header", "true").option("inferSchema", > "true").csv("/data/demo.csv").rdd > val lines = rdd.repartition(200) > val count = lines.count() > } > } > {code} > > Enable AES-based encryption for RPC connection by the following config: > {code:java} > --conf spark.authenticate=true > --conf spark.network.crypto.enabled=true > {code} > This would cause the following error: > {code:java} > java.lang.IllegalArgumentException: Frame length should be positive: > -6119185687804983867 > at > org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:150) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) {code} > The error disappears in 8-jre-slim. It seems that Apache Commons Crypto 1.1.0 > only works with Java 8: > [https://commons.apache.org/proper/commons-crypto/download_crypto.cgi] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Comment Edited] (SPARK-37751) Apache Commons Crypto doesn't support Java 11
[ https://issues.apache.org/jira/browse/SPARK-37751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581506#comment-17581506 ] Benoit Roy edited comment on SPARK-37751 at 8/18/22 7:48 PM: - Hello, we have also encountered this issue after upgrading to Java11 (we also migrated to Spark 3.3.0), so this appears also appears to affect Spark 3.3.0 version. Any suggestions how we can resolve this? - aside from spark.network.crypto.{_}enabled{_} ? was (Author: JIRAUSER293512): Hello, we have also encountered this issue after upgrading to Java11 (we also migrated to Spark 3.3.0), so this appears also appears to affect Spark 3.3.0 version. Any suggestions how we can resolve this? - aside from `spark.network.crypto.enabled` ? > Apache Commons Crypto doesn't support Java 11 > - > > Key: SPARK-37751 > URL: https://issues.apache.org/jira/browse/SPARK-37751 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 3.1.2, 3.2.0 > Environment: Spark 3.2.0 on kubernetes >Reporter: Shipeng Feng >Priority: Major > > For kubernetes, we are using Java 11 in docker, > [https://github.com/apache/spark/blob/v3.2.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile:] > {code:java} > ARG java_image_tag=11-jre-slim > {code} > We have a simple app: > {code:scala} > object SimpleApp { > def main(args: Array[String]) { > val session = SparkSession.builder.getOrCreate > > // the size of demo.csv is 5GB > val rdd = session.read.option("header", "true").option("inferSchema", > "true").csv("/data/demo.csv").rdd > val lines = rdd.repartition(200) > val count = lines.count() > } > } > {code} > > Enable AES-based encryption for RPC connection by the following config: > {code:java} > --conf spark.authenticate=true > --conf spark.network.crypto.enabled=true > {code} > This would cause the following error: > {code:java} > java.lang.IllegalArgumentException: Frame length should be positive: > -6119185687804983867 > at > org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:150) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) {code} > The error disappears in 8-jre-slim. It seems that Apache Commons Crypto 1.1.0 > only works with Java 8: > [https://commons.apache.org/proper/commons-crypto/download_crypto.cgi] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-37751) Apache Commons Crypto doesn't support Java 11
[ https://issues.apache.org/jira/browse/SPARK-37751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581506#comment-17581506 ] Benoit Roy commented on SPARK-37751: Hello, we have also encountered this issue after upgrading to Java11 (we also migrated to Spark 3.3.0), so this appears also appears to affect Spark 3.3.0 version. Any suggestions how we can resolve this? - aside from `spark.network.crypto.enabled` ? > Apache Commons Crypto doesn't support Java 11 > - > > Key: SPARK-37751 > URL: https://issues.apache.org/jira/browse/SPARK-37751 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 3.1.2, 3.2.0 > Environment: Spark 3.2.0 on kubernetes >Reporter: Shipeng Feng >Priority: Major > > For kubernetes, we are using Java 11 in docker, > [https://github.com/apache/spark/blob/v3.2.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile:] > {code:java} > ARG java_image_tag=11-jre-slim > {code} > We have a simple app: > {code:scala} > object SimpleApp { > def main(args: Array[String]) { > val session = SparkSession.builder.getOrCreate > > // the size of demo.csv is 5GB > val rdd = session.read.option("header", "true").option("inferSchema", > "true").csv("/data/demo.csv").rdd > val lines = rdd.repartition(200) > val count = lines.count() > } > } > {code} > > Enable AES-based encryption for RPC connection by the following config: > {code:java} > --conf spark.authenticate=true > --conf spark.network.crypto.enabled=true > {code} > This would cause the following error: > {code:java} > java.lang.IllegalArgumentException: Frame length should be positive: > -6119185687804983867 > at > org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:150) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:190) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) {code} > The error disappears in 8-jre-slim. It seems that Apache Commons Crypto 1.1.0 > only works with Java 8: > [https://commons.apache.org/proper/commons-crypto/download_crypto.cgi] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39399) proxy-user not working for Spark on k8s in cluster deploy mode
[ https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581495#comment-17581495 ] Shrikant Prasad commented on SPARK-39399: - [~dongjoon] [~hyukjin.kwon] Can you please have a look at this issue and let me know if I need to add any more details in order to take this forward. > proxy-user not working for Spark on k8s in cluster deploy mode > -- > > Key: SPARK-39399 > URL: https://issues.apache.org/jira/browse/SPARK-39399 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant Prasad >Priority: Major > > As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user > support was added for Spark on K8s. But the PR only added proxy user argument > on the spark-submit command. The actual functionality of authentication using > the proxy user is not working in case of cluster deploy mode. > We get AccessControlException when trying to access the kerberized HDFS > through a proxy user. > Spark-Submit: > $SPARK_HOME/bin/spark-submit \ > --master \ > --deploy-mode cluster \ > --name with_proxy_user_di \ > --proxy-user \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.limit.cores=1 \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.namespace= \ > --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ > --conf spark.eventLog.enabled=true \ > --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \ > --conf spark.kubernetes.file.upload.path=hdfs:///tmp \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar > Driver Logs: > {code:java} > ++ id -u > + myuid=185 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 185 > + uidentry= > + set -e > + '[' -z '' ']' > + '[' -w /etc/passwd ']' > + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' > + SPARK_CLASSPATH=':/opt/spark/jars/*' > + env > + grep SPARK_JAVA_OPT_ > + sort -t_ -k4 -n > + sed 's/[^=]*=\(.*\)/\1/g' > + readarray -t SPARK_EXECUTOR_JAVA_OPTS > + '[' -n '' ']' > + '[' -z ']' > + '[' -z ']' > + '[' -n '' ']' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*' > + case "$1" in > + shift 1 > + CMD=("$SPARK_HOME/bin/spark-submit" --conf > "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client > "$@") > + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user > --properties-file /opt/spark/conf/spark.properties --class > org.apache.spark.examples.SparkPi spark-internal > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor > java.nio.DirectByteBuffer(long,int) > WARNING: Please consider reporting this to the maintainers of > org.apache.spark.unsafe.Platform > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful > kerberos logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos > logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, > valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private > org.apache.hadoop.metrics2.lib.MutableGaugeLong > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal > with annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops",
[jira] [Commented] (SPARK-39993) Spark on Kubernetes doesn't filter data by date
[ https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581492#comment-17581492 ] Shrikant Prasad commented on SPARK-39993: - [~h.liashchuk] The code snippet you have shared is working fine in cluster deploy mode if we write to hdfs instead of s3. So, don't think there is any issue with k8s master. You might also first check what's the output of df.show() to see if the df contains the expected rows or not. > Spark on Kubernetes doesn't filter data by date > --- > > Key: SPARK-39993 > URL: https://issues.apache.org/jira/browse/SPARK-39993 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 > Environment: Kubernetes v1.23.6 > Spark 3.2.2 > Java 1.8.0_312 > Python 3.9.13 > Aws dependencies: > aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar >Reporter: Hanna Liashchuk >Priority: Major > Labels: kubernetes > > I'm creating a Dataset with type date and saving it into s3. When I read it > and try to use where() clause, I've noticed it doesn't return data even > though it's there > Below is the code snippet I'm running > > {code:java} > from pyspark.sql.types import Row > from pyspark.sql.functions import * > ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", > col("date").cast("date")) > ds.where("date = '2022-01-01'").show() > ds.write.mode("overwrite").parquet("s3a://bucket/test") > df = spark.read.format("parquet").load("s3a://bucket/test") > df.where("date = '2022-01-01'").show() > {code} > The first show() returns data, while the second one - no. > I've noticed that it's Kubernetes master related, as the same code snipped > works ok with master "local" > UPD: if the column is used as a partition and has the type "date" there is no > filtering problem. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40136) Incorrect fragment of query context
[ https://issues.apache.org/jira/browse/SPARK-40136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40136. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37566 [https://github.com/apache/spark/pull/37566] > Incorrect fragment of query context > --- > > Key: SPARK-40136 > URL: https://issues.apache.org/jira/browse/SPARK-40136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > The query context contains just a part of fragment. The code below > demonstrates the issue: > {code:scala} > withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") { > val e = intercept[SparkArithmeticException] { > sql("select 1 / 0").collect() > } > println("'" + e.getQueryContext()(0).fragment() + "'") > } > '1 / ' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests
[ https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40124. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37554 [https://github.com/apache/spark/pull/37554] > Update TPCDS v1.4 q32 for Plan Stability tests > -- > > Key: SPARK-40124 > URL: https://issues.apache.org/jira/browse/SPARK-40124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Assignee: Kapil Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests
[ https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40124: - Assignee: Kapil Singh > Update TPCDS v1.4 q32 for Plan Stability tests > -- > > Key: SPARK-40124 > URL: https://issues.apache.org/jira/browse/SPARK-40124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kapil Singh >Assignee: Kapil Singh >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40000) Add config to toggle whether to automatically add default values for INSERTs without user-specified fields
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581440#comment-17581440 ] Apache Spark commented on SPARK-4: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/37572 > Add config to toggle whether to automatically add default values for INSERTs > without user-specified fields > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27142) Provide REST API for SQL level information
[ https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581424#comment-17581424 ] Yeachan Park commented on SPARK-27142: -- Hi all, thanks a lot for working on this. I recently made a bug request regarding this story, https://issues.apache.org/jira/browse/SPARK-40140. Would you be able to take a look? Thanks! > Provide REST API for SQL level information > -- > > Key: SPARK-27142 > URL: https://issues.apache.org/jira/browse/SPARK-27142 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > Fix For: 3.1.0 > > Attachments: image-2019-03-13-19-29-26-896.png > > > Currently for Monitoring Spark application SQL information is not available > from REST but only via UI. REST provides only > applications,jobs,stages,environment. This Jira is targeted to provide a REST > API so that SQL level information can be found > > Details: > https://issues.apache.org/jira/browse/SPARK-27142?focusedCommentId=16791728=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16791728 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27142) Provide REST API for SQL level information
[ https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581424#comment-17581424 ] Yeachan Park edited comment on SPARK-27142 at 8/18/22 4:14 PM: --- Hi all, thanks a lot for working on this. I recently made a bug report regarding this story, https://issues.apache.org/jira/browse/SPARK-40140. Would you be able to take a look? Thanks! was (Author: JIRAUSER288356): Hi all, thanks a lot for working on this. I recently made a bug request regarding this story, https://issues.apache.org/jira/browse/SPARK-40140. Would you be able to take a look? Thanks! > Provide REST API for SQL level information > -- > > Key: SPARK-27142 > URL: https://issues.apache.org/jira/browse/SPARK-27142 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > Fix For: 3.1.0 > > Attachments: image-2019-03-13-19-29-26-896.png > > > Currently for Monitoring Spark application SQL information is not available > from REST but only via UI. REST provides only > applications,jobs,stages,environment. This Jira is targeted to provide a REST > API so that SQL level information can be found > > Details: > https://issues.apache.org/jira/browse/SPARK-27142?focusedCommentId=16791728=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16791728 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40140) REST API for SQL level information does not show information on running queries
Yeachan Park created SPARK-40140: Summary: REST API for SQL level information does not show information on running queries Key: SPARK-40140 URL: https://issues.apache.org/jira/browse/SPARK-40140 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Yeachan Park Hi All, We noticed that the SQL information REST API implemented in https://issues.apache.org/jira/browse/SPARK-27142 does not return back SQL queries which are currently running. We can only see queries which are completed/failed. As far as I can see, this should be supported since one of the fields in the returned JSON is "runningJobIds". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40139) Filter jobs by Job Group in UI & API
[ https://issues.apache.org/jira/browse/SPARK-40139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeachan Park updated SPARK-40139: - Description: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter in jobs by specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. was: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter in jobs by specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. > Filter jobs by Job Group in UI & API > > > Key: SPARK-40139 > URL: https://issues.apache.org/jira/browse/SPARK-40139 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > Hi all, > We have some cases where it would be useful to be able to have a list of all > jobs belonging to the same job group in the Spark UI. This doesn't yet seem > possible. We also noticed that in the REST API provided by spark, it's not > possible to filter in jobs by specific job groups. > It would be great if we can have the functionality to filter on jobs based on > their job group id in the UI and in the API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40139) Filter jobs by Job Group in UI & API
[ https://issues.apache.org/jira/browse/SPARK-40139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeachan Park updated SPARK-40139: - Description: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter in jobs by specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. was: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter in jobs by specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. > Filter jobs by Job Group in UI & API > > > Key: SPARK-40139 > URL: https://issues.apache.org/jira/browse/SPARK-40139 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > Hi all, > We have some cases where it would be useful to be able to have a list of all > jobs belonging to the same job group in the Spark UI. This doesn't yet seem > possible. We also noticed that in the REST API provided by spark, it's not > possible to filter in jobs by specific job groups. > It would be great if we can have the functionality to filter on jobs based on > their job group id in the UI and in the API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40139) Filter jobs by Job Group in UI & API
[ https://issues.apache.org/jira/browse/SPARK-40139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeachan Park updated SPARK-40139: - Description: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter in jobs by specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. Thank you was: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter in jobs by specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. Thank you, Yeachan > Filter jobs by Job Group in UI & API > > > Key: SPARK-40139 > URL: https://issues.apache.org/jira/browse/SPARK-40139 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > Hi all, > We have some cases where it would be useful to be able to have a list of all > jobs belonging to the same job group in the Spark UI. This doesn't yet seem > possible. We also noticed that in the REST API provided by spark, it's not > possible to filter in jobs by specific job groups. > It would be great if we can have the functionality to filter on jobs based on > their job group id in the UI and in the API. > Thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40139) Filter jobs by Job Group in UI & API
[ https://issues.apache.org/jira/browse/SPARK-40139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeachan Park updated SPARK-40139: - Description: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter in jobs by specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. was: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter in jobs by specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. Thank you > Filter jobs by Job Group in UI & API > > > Key: SPARK-40139 > URL: https://issues.apache.org/jira/browse/SPARK-40139 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > Hi all, > We have some cases where it would be useful to be able to have a list of all > jobs belonging to the same job group in the Spark UI. This doesn't yet seem > possible. We also noticed that in the REST API provided by spark, it's not > possible to filter in jobs by specific job groups. > It would be great if we can have the functionality to filter on jobs based on > their job group id in the UI and in the API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40139) Filter jobs by Job Group in UI & API
[ https://issues.apache.org/jira/browse/SPARK-40139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yeachan Park updated SPARK-40139: - Description: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter in jobs by specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. Thank you, Yeachan was: Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter on specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. Thank you, Yeachan > Filter jobs by Job Group in UI & API > > > Key: SPARK-40139 > URL: https://issues.apache.org/jira/browse/SPARK-40139 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > Hi all, > We have some cases where it would be useful to be able to have a list of all > jobs belonging to the same job group in the Spark UI. This doesn't yet seem > possible. We also noticed that in the REST API provided by spark, it's not > possible to filter in jobs by specific job groups. > It would be great if we can have the functionality to filter on jobs based on > their job group id in the UI and in the API. > Thank you, > Yeachan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40139) Filter jobs by Job Group in UI & API
Yeachan Park created SPARK-40139: Summary: Filter jobs by Job Group in UI & API Key: SPARK-40139 URL: https://issues.apache.org/jira/browse/SPARK-40139 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.2.0 Reporter: Yeachan Park Hi all, We have some cases where it would be useful to be able to have a list of all jobs belonging to the same job group in the Spark UI. This doesn't yet seem possible. We also noticed that in the REST API provided by spark, it's not possible to filter on specific job groups. It would be great if we can have the functionality to filter on jobs based on their job group id in the UI and in the API. Thank you, Yeachan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB
[ https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-38909: --- Assignee: Yang Jie > Encapsulate LevelDB used by ExternalShuffleBlockResolver and > YarnShuffleService as LocalDB > -- > > Key: SPARK-38909 > URL: https://issues.apache.org/jira/browse/SPARK-38909 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}} use {{{}LevelDB > directly{}}}, this is not conducive to extending the use of {{RocksDB}} in > this scenario. This pr is encapsulated for expansibility. It will be the > pre-work of SPARK-3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB
[ https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-38909. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36200 [https://github.com/apache/spark/pull/36200] > Encapsulate LevelDB used by ExternalShuffleBlockResolver and > YarnShuffleService as LocalDB > -- > > Key: SPARK-38909 > URL: https://issues.apache.org/jira/browse/SPARK-38909 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}} use {{{}LevelDB > directly{}}}, this is not conducive to extending the use of {{RocksDB}} in > this scenario. This pr is encapsulated for expansibility. It will be the > pre-work of SPARK-3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39975) Upgrade rocksdbjni to 7.4.5
[ https://issues.apache.org/jira/browse/SPARK-39975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39975. -- Fix Version/s: 3.4.0 Assignee: Yang Jie Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37543 > Upgrade rocksdbjni to 7.4.5 > --- > > Key: SPARK-39975 > URL: https://issues.apache.org/jira/browse/SPARK-39975 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > [https://github.com/facebook/rocksdb/releases/tag/v7.4.5] > > {code:java} > Fix a bug starting in 7.4.0 in which some fsync operations might be skipped > in a DB after any DropColumnFamily on that DB, until it is re-opened. This > can lead to data loss on power loss. (For custom FileSystem implementations, > this could lead to FSDirectory::Fsync or FSDirectory::Close after the first > FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.) > {code} > > Fixed a bug that caused data loss > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"
[ https://issues.apache.org/jira/browse/SPARK-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581363#comment-17581363 ] Apache Spark commented on SPARK-21487: -- User 'santosh-d3vpl3x' has created a pull request for this issue: https://github.com/apache/spark/pull/37571 > WebUI-Executors Page results in "Request is a replay (34) attack" > - > > Key: SPARK-21487 > URL: https://issues.apache.org/jira/browse/SPARK-21487 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: ShuMing Li >Priority: Minor > > We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors > Page` becomed empty, with the exception below. > `Executor Page` rendering using javascript language rather than scala in > 2.1.1, but I don't know why causes this result? > "two queries are submitted at the same time and have the same timestamp may > cause this result", but I'm not sure? > ResouceManager log: > {code:java} > 2017-07-20 20:39:09,371 WARN > org.apache.hadoop.security.authentication.server.AuthenticationFilter: > Authentication exception: GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > {code} > Safari explorer console > {code:java} > Failed to load resource: the server responded with a status of 403 > (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request > is a replay > (34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html > {code} > Related Links: > https://issues.apache.org/jira/browse/HIVE-12481 > https://issues.apache.org/jira/browse/HADOOP-8830 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21487) WebUI-Executors Page results in "Request is a replay (34) attack"
[ https://issues.apache.org/jira/browse/SPARK-21487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581351#comment-17581351 ] Santosh Pingale commented on SPARK-21487: - I believe this still an issue with kerberised hadoop clusters and sometimes, a major one when you have to debug something. This issue is present in all the secured hadoop clusters I have worked with. Finally at current org, I managed to get it working by patching spark internally. *Whats happening:* At some point, spark UI started sending mustache templates over AJAX call for rendering to improve performance. Those files have `.html` extension, here the yarn applies filter twice! *What could be done:* While yarn is causing the issue here, spark can also make itself agnostic to this issue and allow users to use spark UI in its true form. In reality, mustache files should be `.mustache` instead of `.html`. This change alone allows us to render templates properly. I have tested it to work on local and on cluster. I can raise a PR and maybe we can discuss this over there. > WebUI-Executors Page results in "Request is a replay (34) attack" > - > > Key: SPARK-21487 > URL: https://issues.apache.org/jira/browse/SPARK-21487 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: ShuMing Li >Priority: Minor > > We upgraded Spark version from 2.0.2 to 2.1.1 recently, WebUI `Executors > Page` becomed empty, with the exception below. > `Executor Page` rendering using javascript language rather than scala in > 2.1.1, but I don't know why causes this result? > "two queries are submitted at the same time and have the same timestamp may > cause this result", but I'm not sure? > ResouceManager log: > {code:java} > 2017-07-20 20:39:09,371 WARN > org.apache.hadoop.security.authentication.server.AuthenticationFilter: > Authentication exception: GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > {code} > Safari explorer console > {code:java} > Failed to load resource: the server responded with a status of 403 > (GSSException: Failure unspecified at GSS-API level (Mechanism level: Request > is a replay > (34)))http://hadoop-rm-host:8088/proxy/application_1494564992156_2751285/static/executorspage-template.html > {code} > Related Links: > https://issues.apache.org/jira/browse/HIVE-12481 > https://issues.apache.org/jira/browse/HADOOP-8830 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40087) Support multiple Column drop in R
[ https://issues.apache.org/jira/browse/SPARK-40087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40087. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37526 [https://github.com/apache/spark/pull/37526] > Support multiple Column drop in R > - > > Key: SPARK-40087 > URL: https://issues.apache.org/jira/browse/SPARK-40087 > Project: Spark > Issue Type: New Feature > Components: R >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Assignee: Santosh Pingale >Priority: Minor > Fix For: 3.4.0 > > > This is a followup on SPARK-39895. The PR previously attempted to adjust > implementation for R as well to match signatures but that part was removed > and we only focused on getting python implementation to behave correctly. > *{{Change supports following operations:}}* > {{df <- select(read.json(jsonPath), "name", "age")}} > {{df$age2 <- df$age}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > {{df1 <- drop(df, df$age, column("random"))}} > {{expect_equal(columns(df1), c("name", "age2"))}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40087) Support multiple Column drop in R
[ https://issues.apache.org/jira/browse/SPARK-40087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40087: Assignee: Santosh Pingale > Support multiple Column drop in R > - > > Key: SPARK-40087 > URL: https://issues.apache.org/jira/browse/SPARK-40087 > Project: Spark > Issue Type: New Feature > Components: R >Affects Versions: 3.3.0 >Reporter: Santosh Pingale >Assignee: Santosh Pingale >Priority: Minor > > This is a followup on SPARK-39895. The PR previously attempted to adjust > implementation for R as well to match signatures but that part was removed > and we only focused on getting python implementation to behave correctly. > *{{Change supports following operations:}}* > {{df <- select(read.json(jsonPath), "name", "age")}} > {{df$age2 <- df$age}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > {{df1 <- drop(df, df$age, column("random"))}} > {{expect_equal(columns(df1), c("name", "age2"))}} > {{df1 <- drop(df, df$age, df$name)}} > {{expect_equal(columns(df1), c("age2"))}} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40120) Make pyspark.sql.readwriter examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40120: Assignee: Apache Spark > Make pyspark.sql.readwriter examples self-contained > --- > > Key: SPARK-40120 > URL: https://issues.apache.org/jira/browse/SPARK-40120 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40120) Make pyspark.sql.readwriter examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40120: Assignee: (was: Apache Spark) > Make pyspark.sql.readwriter examples self-contained > --- > > Key: SPARK-40120 > URL: https://issues.apache.org/jira/browse/SPARK-40120 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40120) Make pyspark.sql.readwriter examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581347#comment-17581347 ] Apache Spark commented on SPARK-40120: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37570 > Make pyspark.sql.readwriter examples self-contained > --- > > Key: SPARK-40120 > URL: https://issues.apache.org/jira/browse/SPARK-40120 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39993) Spark on Kubernetes doesn't filter data by date
[ https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hanna Liashchuk updated SPARK-39993: Description: I'm creating a Dataset with type date and saving it into s3. When I read it and try to use where() clause, I've noticed it doesn't return data even though it's there Below is the code snippet I'm running {code:java} from pyspark.sql.types import Row from pyspark.sql.functions import * ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", col("date").cast("date")) ds.where("date = '2022-01-01'").show() ds.write.mode("overwrite").parquet("s3a://bucket/test") df = spark.read.format("parquet").load("s3a://bucket/test") df.where("date = '2022-01-01'").show() {code} The first show() returns data, while the second one - no. I've noticed that it's Kubernetes master related, as the same code snipped works ok with master "local" UPD: if the column is used as a partition and has the type "date" there is no filtering problem. was: I'm creating a Dataset with type date and saving it into s3. When I read it and try to use where() clause, I've noticed it doesn't return data even though it's there Below is the code snippet I'm running {code:java} from pyspark.sql.types import Row from pyspark.sql.functions import * ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", col("date").cast("date")) ds.where("date = '2022-01-01'").show() ds.write.mode("overwrite").parquet("s3a://bucket/test") df = spark.read.format("parquet").load("s3a://bucket/test") df.where("date = '2022-01-01'").show() {code} The first show() returns data, while the second one - no. I've noticed that it's Kubernetes master related, as the same code snipped works ok with master "local" UPD: if the column is used as a partition and has the type "date" or is de facto date but has the type "string", there is no filtering problem. > Spark on Kubernetes doesn't filter data by date > --- > > Key: SPARK-39993 > URL: https://issues.apache.org/jira/browse/SPARK-39993 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 > Environment: Kubernetes v1.23.6 > Spark 3.2.2 > Java 1.8.0_312 > Python 3.9.13 > Aws dependencies: > aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar >Reporter: Hanna Liashchuk >Priority: Major > Labels: kubernetes > > I'm creating a Dataset with type date and saving it into s3. When I read it > and try to use where() clause, I've noticed it doesn't return data even > though it's there > Below is the code snippet I'm running > > {code:java} > from pyspark.sql.types import Row > from pyspark.sql.functions import * > ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", > col("date").cast("date")) > ds.where("date = '2022-01-01'").show() > ds.write.mode("overwrite").parquet("s3a://bucket/test") > df = spark.read.format("parquet").load("s3a://bucket/test") > df.where("date = '2022-01-01'").show() > {code} > The first show() returns data, while the second one - no. > I've noticed that it's Kubernetes master related, as the same code snipped > works ok with master "local" > UPD: if the column is used as a partition and has the type "date" there is no > filtering problem. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40138) Implement DataFrame.mode
[ https://issues.apache.org/jira/browse/SPARK-40138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40138: Assignee: Apache Spark > Implement DataFrame.mode > > > Key: SPARK-40138 > URL: https://issues.apache.org/jira/browse/SPARK-40138 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40138) Implement DataFrame.mode
[ https://issues.apache.org/jira/browse/SPARK-40138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581315#comment-17581315 ] Apache Spark commented on SPARK-40138: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37569 > Implement DataFrame.mode > > > Key: SPARK-40138 > URL: https://issues.apache.org/jira/browse/SPARK-40138 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40138) Implement DataFrame.mode
[ https://issues.apache.org/jira/browse/SPARK-40138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581316#comment-17581316 ] Apache Spark commented on SPARK-40138: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37569 > Implement DataFrame.mode > > > Key: SPARK-40138 > URL: https://issues.apache.org/jira/browse/SPARK-40138 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40138) Implement DataFrame.mode
[ https://issues.apache.org/jira/browse/SPARK-40138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40138: Assignee: (was: Apache Spark) > Implement DataFrame.mode > > > Key: SPARK-40138 > URL: https://issues.apache.org/jira/browse/SPARK-40138 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40138) Implement DataFrame.mode
Ruifeng Zheng created SPARK-40138: - Summary: Implement DataFrame.mode Key: SPARK-40138 URL: https://issues.apache.org/jira/browse/SPARK-40138 Project: Spark Issue Type: Improvement Components: ps Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35542) Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it.
[ https://issues.apache.org/jira/browse/SPARK-35542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35542: Assignee: Weichen Xu (was: Apache Spark) > Bucketizer created for multiple columns with parameters splitsArray, > inputCols and outputCols can not be loaded after saving it. > - > > Key: SPARK-35542 > URL: https://issues.apache.org/jira/browse/SPARK-35542 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.1 > Environment: {color:#172b4d}DataBricks Spark 3.1.1{color} >Reporter: Srikanth Pusarla >Assignee: Weichen Xu >Priority: Minor > Attachments: Code-error.PNG, traceback.png > > > Bucketizer created for multiple columns with parameters *splitsArray*, > *inputCols* and *outputCols* can not be loaded after saving it. > The problem is not seen for Bucketizer created for single column. > *Code to reproduce* > ### > from pyspark.ml.feature import Bucketizer > df = spark.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"]) > bucketizer = Bucketizer(*splitsArray*= [[-float("inf"), 0.5, 1.4, > float("inf")], [-float("inf"), 0.1, 1.2, float("inf")]], > *inputCols*=["values", "values"], *outputCols*=["b1", "b2"]) > bucketed = bucketizer.transform(df).collect() > dfb = bucketizer.transform(df) > print(dfb.show()) > bucketizerPath = "dbfs:/mnt/S3-Bucket/" + "Bucketizer" > bucketizer.write().overwrite().save(bucketizerPath) > loadedBucketizer = {color:#FF}Bucketizer.load(bucketizerPath) > Failing here{color} > loadedBucketizer.getSplits() == bucketizer.getSplits() > > The error message is > {color:#FF}*TypeError: array() argument 1 must be a unicode character, > not bytes*{color} > > *BackTrace:* > > -- > TypeError Traceback (most recent call last) in 15 > 16 bucketizer.write().overwrite().save(bucketizerPath) > ---> 17 loadedBucketizer = Bucketizer.load(bucketizerPath) > 18 loadedBucketizer.getSplits() == bucketizer.getSplits() > > /databricks/spark/python/pyspark/ml/util.py in load(cls, path) > 376 def load(cls, path): > 377 """Reads an ML instance from the input path, a shortcut of > `read().load(path)`.""" > --> 378 return cls.read().load(path) > 379 > 380 > > /databricks/spark/python/pyspark/ml/util.py in load(self, path) > 330 raise NotImplementedError("This Java ML type cannot be loaded into Python > currently: %r" > 331 % self._clazz) > --> 332 return self._clazz._from_java(java_obj) > 333 > 334 > > def session(self, sparkSession): > /databricks/spark/python/pyspark/ml/wrapper.py in _from_java(java_stage) > 258 > 259 py_stage._resetUid(java_stage.uid()) > --> 260 py_stage._transfer_params_from_java() > 261 elif hasattr(py_type, "_from_java"): > 262 py_stage = py_type._from_java(java_stage) > > /databricks/spark/python/pyspark/ml/wrapper.py in > _transfer_params_from_java(self) > 186 # SPARK-14931: Only check set params back to avoid default params > mismatch. > 187 if self._java_obj.isSet(java_param): --> > 188 value = _java2py(sc, self._java_obj.getOrDefault(java_param)) > 189 self._set(**{param.name: value}) > 190 # SPARK-10931: Temporary fix for params that have a default in Java > > /databricks/spark/python/pyspark/ml/common.py in _java2py(sc, r, encoding) > 107 > 108 if isinstance(r, (bytearray, bytes)): > --> 109 r = PickleSerializer().loads(bytes(r), encoding=encoding) > 110 return r > 111 > > /databricks/spark/python/pyspark/serializers.py in loads(self, obj, encoding) > 467 > 468 def loads(self, obj, encoding="bytes"): > --> 469 return pickle.loads(obj, encoding=encoding) > 470 > 471 > > TypeError: array() argument 1 must be a unicode character, not bytes > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35542) Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it.
[ https://issues.apache.org/jira/browse/SPARK-35542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581272#comment-17581272 ] Apache Spark commented on SPARK-35542: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/37568 > Bucketizer created for multiple columns with parameters splitsArray, > inputCols and outputCols can not be loaded after saving it. > - > > Key: SPARK-35542 > URL: https://issues.apache.org/jira/browse/SPARK-35542 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.1 > Environment: {color:#172b4d}DataBricks Spark 3.1.1{color} >Reporter: Srikanth Pusarla >Assignee: Weichen Xu >Priority: Minor > Attachments: Code-error.PNG, traceback.png > > > Bucketizer created for multiple columns with parameters *splitsArray*, > *inputCols* and *outputCols* can not be loaded after saving it. > The problem is not seen for Bucketizer created for single column. > *Code to reproduce* > ### > from pyspark.ml.feature import Bucketizer > df = spark.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"]) > bucketizer = Bucketizer(*splitsArray*= [[-float("inf"), 0.5, 1.4, > float("inf")], [-float("inf"), 0.1, 1.2, float("inf")]], > *inputCols*=["values", "values"], *outputCols*=["b1", "b2"]) > bucketed = bucketizer.transform(df).collect() > dfb = bucketizer.transform(df) > print(dfb.show()) > bucketizerPath = "dbfs:/mnt/S3-Bucket/" + "Bucketizer" > bucketizer.write().overwrite().save(bucketizerPath) > loadedBucketizer = {color:#FF}Bucketizer.load(bucketizerPath) > Failing here{color} > loadedBucketizer.getSplits() == bucketizer.getSplits() > > The error message is > {color:#FF}*TypeError: array() argument 1 must be a unicode character, > not bytes*{color} > > *BackTrace:* > > -- > TypeError Traceback (most recent call last) in 15 > 16 bucketizer.write().overwrite().save(bucketizerPath) > ---> 17 loadedBucketizer = Bucketizer.load(bucketizerPath) > 18 loadedBucketizer.getSplits() == bucketizer.getSplits() > > /databricks/spark/python/pyspark/ml/util.py in load(cls, path) > 376 def load(cls, path): > 377 """Reads an ML instance from the input path, a shortcut of > `read().load(path)`.""" > --> 378 return cls.read().load(path) > 379 > 380 > > /databricks/spark/python/pyspark/ml/util.py in load(self, path) > 330 raise NotImplementedError("This Java ML type cannot be loaded into Python > currently: %r" > 331 % self._clazz) > --> 332 return self._clazz._from_java(java_obj) > 333 > 334 > > def session(self, sparkSession): > /databricks/spark/python/pyspark/ml/wrapper.py in _from_java(java_stage) > 258 > 259 py_stage._resetUid(java_stage.uid()) > --> 260 py_stage._transfer_params_from_java() > 261 elif hasattr(py_type, "_from_java"): > 262 py_stage = py_type._from_java(java_stage) > > /databricks/spark/python/pyspark/ml/wrapper.py in > _transfer_params_from_java(self) > 186 # SPARK-14931: Only check set params back to avoid default params > mismatch. > 187 if self._java_obj.isSet(java_param): --> > 188 value = _java2py(sc, self._java_obj.getOrDefault(java_param)) > 189 self._set(**{param.name: value}) > 190 # SPARK-10931: Temporary fix for params that have a default in Java > > /databricks/spark/python/pyspark/ml/common.py in _java2py(sc, r, encoding) > 107 > 108 if isinstance(r, (bytearray, bytes)): > --> 109 r = PickleSerializer().loads(bytes(r), encoding=encoding) > 110 return r > 111 > > /databricks/spark/python/pyspark/serializers.py in loads(self, obj, encoding) > 467 > 468 def loads(self, obj, encoding="bytes"): > --> 469 return pickle.loads(obj, encoding=encoding) > 470 > 471 > > TypeError: array() argument 1 must be a unicode character, not bytes > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35542) Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it.
[ https://issues.apache.org/jira/browse/SPARK-35542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35542: Assignee: Apache Spark (was: Weichen Xu) > Bucketizer created for multiple columns with parameters splitsArray, > inputCols and outputCols can not be loaded after saving it. > - > > Key: SPARK-35542 > URL: https://issues.apache.org/jira/browse/SPARK-35542 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.1 > Environment: {color:#172b4d}DataBricks Spark 3.1.1{color} >Reporter: Srikanth Pusarla >Assignee: Apache Spark >Priority: Minor > Attachments: Code-error.PNG, traceback.png > > > Bucketizer created for multiple columns with parameters *splitsArray*, > *inputCols* and *outputCols* can not be loaded after saving it. > The problem is not seen for Bucketizer created for single column. > *Code to reproduce* > ### > from pyspark.ml.feature import Bucketizer > df = spark.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"]) > bucketizer = Bucketizer(*splitsArray*= [[-float("inf"), 0.5, 1.4, > float("inf")], [-float("inf"), 0.1, 1.2, float("inf")]], > *inputCols*=["values", "values"], *outputCols*=["b1", "b2"]) > bucketed = bucketizer.transform(df).collect() > dfb = bucketizer.transform(df) > print(dfb.show()) > bucketizerPath = "dbfs:/mnt/S3-Bucket/" + "Bucketizer" > bucketizer.write().overwrite().save(bucketizerPath) > loadedBucketizer = {color:#FF}Bucketizer.load(bucketizerPath) > Failing here{color} > loadedBucketizer.getSplits() == bucketizer.getSplits() > > The error message is > {color:#FF}*TypeError: array() argument 1 must be a unicode character, > not bytes*{color} > > *BackTrace:* > > -- > TypeError Traceback (most recent call last) in 15 > 16 bucketizer.write().overwrite().save(bucketizerPath) > ---> 17 loadedBucketizer = Bucketizer.load(bucketizerPath) > 18 loadedBucketizer.getSplits() == bucketizer.getSplits() > > /databricks/spark/python/pyspark/ml/util.py in load(cls, path) > 376 def load(cls, path): > 377 """Reads an ML instance from the input path, a shortcut of > `read().load(path)`.""" > --> 378 return cls.read().load(path) > 379 > 380 > > /databricks/spark/python/pyspark/ml/util.py in load(self, path) > 330 raise NotImplementedError("This Java ML type cannot be loaded into Python > currently: %r" > 331 % self._clazz) > --> 332 return self._clazz._from_java(java_obj) > 333 > 334 > > def session(self, sparkSession): > /databricks/spark/python/pyspark/ml/wrapper.py in _from_java(java_stage) > 258 > 259 py_stage._resetUid(java_stage.uid()) > --> 260 py_stage._transfer_params_from_java() > 261 elif hasattr(py_type, "_from_java"): > 262 py_stage = py_type._from_java(java_stage) > > /databricks/spark/python/pyspark/ml/wrapper.py in > _transfer_params_from_java(self) > 186 # SPARK-14931: Only check set params back to avoid default params > mismatch. > 187 if self._java_obj.isSet(java_param): --> > 188 value = _java2py(sc, self._java_obj.getOrDefault(java_param)) > 189 self._set(**{param.name: value}) > 190 # SPARK-10931: Temporary fix for params that have a default in Java > > /databricks/spark/python/pyspark/ml/common.py in _java2py(sc, r, encoding) > 107 > 108 if isinstance(r, (bytearray, bytes)): > --> 109 r = PickleSerializer().loads(bytes(r), encoding=encoding) > 110 return r > 111 > > /databricks/spark/python/pyspark/serializers.py in loads(self, obj, encoding) > 467 > 468 def loads(self, obj, encoding="bytes"): > --> 469 return pickle.loads(obj, encoding=encoding) > 470 > 471 > > TypeError: array() argument 1 must be a unicode character, not bytes > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39271) Upgrade pandas to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-39271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39271: Assignee: Apache Spark > Upgrade pandas to 1.4.3 > --- > > Key: SPARK-39271 > URL: https://issues.apache.org/jira/browse/SPARK-39271 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39271) Upgrade pandas to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-39271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39271: Assignee: (was: Apache Spark) > Upgrade pandas to 1.4.3 > --- > > Key: SPARK-39271 > URL: https://issues.apache.org/jira/browse/SPARK-39271 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39271) Upgrade pandas to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-39271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581266#comment-17581266 ] Apache Spark commented on SPARK-39271: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37567 > Upgrade pandas to 1.4.3 > --- > > Key: SPARK-39271 > URL: https://issues.apache.org/jira/browse/SPARK-39271 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39271) Upgrade pandas to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-39271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-39271: Summary: Upgrade pandas to 1.4.3 (was: Upgrade pandas to 1.4.2) > Upgrade pandas to 1.4.3 > --- > > Key: SPARK-39271 > URL: https://issues.apache.org/jira/browse/SPARK-39271 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35542) Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it.
[ https://issues.apache.org/jira/browse/SPARK-35542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-35542: -- Assignee: Weichen Xu > Bucketizer created for multiple columns with parameters splitsArray, > inputCols and outputCols can not be loaded after saving it. > - > > Key: SPARK-35542 > URL: https://issues.apache.org/jira/browse/SPARK-35542 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.1 > Environment: {color:#172b4d}DataBricks Spark 3.1.1{color} >Reporter: Srikanth Pusarla >Assignee: Weichen Xu >Priority: Minor > Attachments: Code-error.PNG, traceback.png > > > Bucketizer created for multiple columns with parameters *splitsArray*, > *inputCols* and *outputCols* can not be loaded after saving it. > The problem is not seen for Bucketizer created for single column. > *Code to reproduce* > ### > from pyspark.ml.feature import Bucketizer > df = spark.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], ["values"]) > bucketizer = Bucketizer(*splitsArray*= [[-float("inf"), 0.5, 1.4, > float("inf")], [-float("inf"), 0.1, 1.2, float("inf")]], > *inputCols*=["values", "values"], *outputCols*=["b1", "b2"]) > bucketed = bucketizer.transform(df).collect() > dfb = bucketizer.transform(df) > print(dfb.show()) > bucketizerPath = "dbfs:/mnt/S3-Bucket/" + "Bucketizer" > bucketizer.write().overwrite().save(bucketizerPath) > loadedBucketizer = {color:#FF}Bucketizer.load(bucketizerPath) > Failing here{color} > loadedBucketizer.getSplits() == bucketizer.getSplits() > > The error message is > {color:#FF}*TypeError: array() argument 1 must be a unicode character, > not bytes*{color} > > *BackTrace:* > > -- > TypeError Traceback (most recent call last) in 15 > 16 bucketizer.write().overwrite().save(bucketizerPath) > ---> 17 loadedBucketizer = Bucketizer.load(bucketizerPath) > 18 loadedBucketizer.getSplits() == bucketizer.getSplits() > > /databricks/spark/python/pyspark/ml/util.py in load(cls, path) > 376 def load(cls, path): > 377 """Reads an ML instance from the input path, a shortcut of > `read().load(path)`.""" > --> 378 return cls.read().load(path) > 379 > 380 > > /databricks/spark/python/pyspark/ml/util.py in load(self, path) > 330 raise NotImplementedError("This Java ML type cannot be loaded into Python > currently: %r" > 331 % self._clazz) > --> 332 return self._clazz._from_java(java_obj) > 333 > 334 > > def session(self, sparkSession): > /databricks/spark/python/pyspark/ml/wrapper.py in _from_java(java_stage) > 258 > 259 py_stage._resetUid(java_stage.uid()) > --> 260 py_stage._transfer_params_from_java() > 261 elif hasattr(py_type, "_from_java"): > 262 py_stage = py_type._from_java(java_stage) > > /databricks/spark/python/pyspark/ml/wrapper.py in > _transfer_params_from_java(self) > 186 # SPARK-14931: Only check set params back to avoid default params > mismatch. > 187 if self._java_obj.isSet(java_param): --> > 188 value = _java2py(sc, self._java_obj.getOrDefault(java_param)) > 189 self._set(**{param.name: value}) > 190 # SPARK-10931: Temporary fix for params that have a default in Java > > /databricks/spark/python/pyspark/ml/common.py in _java2py(sc, r, encoding) > 107 > 108 if isinstance(r, (bytearray, bytes)): > --> 109 r = PickleSerializer().loads(bytes(r), encoding=encoding) > 110 return r > 111 > > /databricks/spark/python/pyspark/serializers.py in loads(self, obj, encoding) > 467 > 468 def loads(self, obj, encoding="bytes"): > --> 469 return pickle.loads(obj, encoding=encoding) > 470 > 471 > > TypeError: array() argument 1 must be a unicode character, not bytes > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40136) Incorrect fragment of query context
[ https://issues.apache.org/jira/browse/SPARK-40136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581252#comment-17581252 ] Apache Spark commented on SPARK-40136: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37566 > Incorrect fragment of query context > --- > > Key: SPARK-40136 > URL: https://issues.apache.org/jira/browse/SPARK-40136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The query context contains just a part of fragment. The code below > demonstrates the issue: > {code:scala} > withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") { > val e = intercept[SparkArithmeticException] { > sql("select 1 / 0").collect() > } > println("'" + e.getQueryContext()(0).fragment() + "'") > } > '1 / ' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40136) Incorrect fragment of query context
[ https://issues.apache.org/jira/browse/SPARK-40136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40136: Assignee: Max Gekk (was: Apache Spark) > Incorrect fragment of query context > --- > > Key: SPARK-40136 > URL: https://issues.apache.org/jira/browse/SPARK-40136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The query context contains just a part of fragment. The code below > demonstrates the issue: > {code:scala} > withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") { > val e = intercept[SparkArithmeticException] { > sql("select 1 / 0").collect() > } > println("'" + e.getQueryContext()(0).fragment() + "'") > } > '1 / ' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40136) Incorrect fragment of query context
[ https://issues.apache.org/jira/browse/SPARK-40136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40136: Assignee: Apache Spark (was: Max Gekk) > Incorrect fragment of query context > --- > > Key: SPARK-40136 > URL: https://issues.apache.org/jira/browse/SPARK-40136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > The query context contains just a part of fragment. The code below > demonstrates the issue: > {code:scala} > withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") { > val e = intercept[SparkArithmeticException] { > sql("select 1 / 0").collect() > } > println("'" + e.getQueryContext()(0).fragment() + "'") > } > '1 / ' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40136) Incorrect fragment of query context
[ https://issues.apache.org/jira/browse/SPARK-40136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581251#comment-17581251 ] Apache Spark commented on SPARK-40136: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37566 > Incorrect fragment of query context > --- > > Key: SPARK-40136 > URL: https://issues.apache.org/jira/browse/SPARK-40136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The query context contains just a part of fragment. The code below > demonstrates the issue: > {code:scala} > withSQLConf(SQLConf.ANSI_ENABLED.key -> "true") { > val e = intercept[SparkArithmeticException] { > sql("select 1 / 0").collect() > } > println("'" + e.getQueryContext()(0).fragment() + "'") > } > '1 / ' > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40137) Combines limits after projection
[ https://issues.apache.org/jira/browse/SPARK-40137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40137: Assignee: (was: Apache Spark) > Combines limits after projection > > > Key: SPARK-40137 > URL: https://issues.apache.org/jira/browse/SPARK-40137 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: Xianyang Liu >Priority: Major > > `Dataset.show` will add extra `Limit` and `Projection` on top of the given > logical plan. If the `Dataset` is already a limit job, this will introduce an > extra shuffle phase. So we should combine the limit after projection. > For example: > ```scala > spark.sql("select * from spark.store_sales limit 10").show() > ``` > Before: > ``` > == Physical Plan == > AdaptiveSparkPlan (12) > +- == Final Plan == >* Project (7) >+- * GlobalLimit (6) > +- ShuffleQueryStage (5), Statistics(sizeInBytes=185.6 KiB, > rowCount=990) > +- Exchange (4) > +- * LocalLimit (3) >+- * ColumnarToRow (2) > +- Scan parquet spark_catalog.spark.store_sales (1) > +- == Initial Plan == >Project (11) >+- GlobalLimit (10) > +- Exchange (9) > +- LocalLimit (8) > +- Scan parquet spark_catalog.spark.store_sales (1) > ``` > After: > ``` > == Physical Plan == > CollectLimit (4) > +- * Project (3) >+- * ColumnarToRow (2) > +- Scan parquet spark_catalog.spark.store_sales (1) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40137) Combines limits after projection
[ https://issues.apache.org/jira/browse/SPARK-40137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581248#comment-17581248 ] Apache Spark commented on SPARK-40137: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/37565 > Combines limits after projection > > > Key: SPARK-40137 > URL: https://issues.apache.org/jira/browse/SPARK-40137 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.2 >Reporter: Xianyang Liu >Priority: Major > > `Dataset.show` will add extra `Limit` and `Projection` on top of the given > logical plan. If the `Dataset` is already a limit job, this will introduce an > extra shuffle phase. So we should combine the limit after projection. > For example: > ```scala > spark.sql("select * from spark.store_sales limit 10").show() > ``` > Before: > ``` > == Physical Plan == > AdaptiveSparkPlan (12) > +- == Final Plan == >* Project (7) >+- * GlobalLimit (6) > +- ShuffleQueryStage (5), Statistics(sizeInBytes=185.6 KiB, > rowCount=990) > +- Exchange (4) > +- * LocalLimit (3) >+- * ColumnarToRow (2) > +- Scan parquet spark_catalog.spark.store_sales (1) > +- == Initial Plan == >Project (11) >+- GlobalLimit (10) > +- Exchange (9) > +- LocalLimit (8) > +- Scan parquet spark_catalog.spark.store_sales (1) > ``` > After: > ``` > == Physical Plan == > CollectLimit (4) > +- * Project (3) >+- * ColumnarToRow (2) > +- Scan parquet spark_catalog.spark.store_sales (1) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org