[jira] [Commented] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
[ https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751899#comment-17751899 ] Hyukjin Kwon commented on SPARK-44717: -- I think we should at least make this respects {{spark.sql.timestampType}} when it;s set to {{TIMESTAMP_NTZ}}. Took a quick look, and I think there's some problem in calculation logic in https://github.com/apache/spark/blob/master/python/pyspark/pandas/resample.py#L277-L310 especially date_trunc returns always {{TimestampType}}. We might need a dedicated internal expression cc [~podongfeng] > "pyspark.pandas.resample" is incorrect when DST is overlapped and setting > "spark.sql.timestampType" to TIMESTAMP_NTZ does not help > -- > > Key: SPARK-44717 > URL: https://issues.apache.org/jira/browse/SPARK-44717 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.4.0, 3.4.1, 4.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Use one of the existing test: > - "11H" case of test_dataframe_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > - "1001H" case of test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > After setting the TZ for example to New York (like by using the following > python code in a "setUpClass": > {noformat} > os.environ["TZ"] = 'America/New_York' > {noformat}) > You will get the error for the latter mentioned test: > {noformat} > == > FAIL [4.219s]: test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 276, in test_series_resample > self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", > "right", "sum") > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 259, in _test_resample > self.assert_eq( > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in > assert_eq > _assert_pandas_almost_equal(lobj, robj) > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in > _assert_pandas_almost_equal > raise PySparkAssertionError( > pyspark.errors.exceptions.base.PySparkAssertionError: > [DIFFERENT_PANDAS_SERIES] Series are not almost equal: > Left: > Freq: 1001H > float64 > Right: > float64 > {noformat} > The problem is the in the pyspark resample there will be more resampled rows > in the result. The DST change will cause those extra lines as the computed > __tmp_resample_bin_col__ be something like: > {noformat} > | __index_level_0__ | __tmp_resample_bin_col__ | A > . > |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | > |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | > |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | > |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | > |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | > |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | > |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | > |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | > |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | > |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| > |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 | > {noformat} > You can see the extra lines around when the DST kicked in on 2011-03-13 in > New York. > Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not > help. > You can see my tests here: > https://github.com/attilapiros/spark/pull/5 > Pandas timestamps are TZ less: > ` > {noformat} > import pandas as pd > a = pd.Timestamp(year=2011, month=3, day=13, hour=1) > b = pd.Timedelta(hours=1) > >> a > Timestamp('2011-03-13 01:00:00') > >>> a+b > Timestamp('2011-03-13 02:00:00') > >>> a+b+b > Timestamp('2011-03-13 03:00:00') > {noformat} > But pyspark TimestampType uses TZ and DST: > {noformat} > >>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() > +---+ > |TIMESTAMP '2011-03-13 01:00:00'| > +---+ > |2011-03-13 01:00:00| > +---+ > >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + > >>> make_interval(0,0,0,0,1,0,0)").show() > ++ > |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| > +
[jira] [Updated] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
[ https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-44717: --- Description: Use one of the existing test: - "11H" case of test_dataframe_resample (pyspark.pandas.tests.test_resample.ResampleTests) - "1001H" case of test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) After setting the TZ for example to New York (like by using the following python code in a "setUpClass": {noformat} os.environ["TZ"] = 'America/New_York' {noformat}) You will get the error for the latter mentioned test: {noformat} == FAIL [4.219s]: test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 276, in test_series_resample self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum") File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 259, in _test_resample self.assert_eq( File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in assert_eq _assert_pandas_almost_equal(lobj, robj) File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in _assert_pandas_almost_equal raise PySparkAssertionError( pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] Series are not almost equal: Left: Freq: 1001H float64 Right: float64 {noformat} The problem is the in the pyspark resample there will be more resampled rows in the result. The DST change will cause those extra lines as the computed __tmp_resample_bin_col__ be something like: {noformat} | __index_level_0__ | __tmp_resample_bin_col__ | A . |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 | {noformat} You can see the extra lines around when the DST kicked in on 2011-03-13 in New York. Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help. You can see my tests here: https://github.com/attilapiros/spark/pull/5 Pandas timestamps are TZ less: ` {noformat} import pandas as pd a = pd.Timestamp(year=2011, month=3, day=13, hour=1) b = pd.Timedelta(hours=1) >> a Timestamp('2011-03-13 01:00:00') >>> a+b Timestamp('2011-03-13 02:00:00') >>> a+b+b Timestamp('2011-03-13 03:00:00') {noformat} But pyspark TimestampType uses TZ and DST: {noformat} >>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() +---+ |TIMESTAMP '2011-03-13 01:00:00'| +---+ |2011-03-13 01:00:00| +---+ >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + >>> make_interval(0,0,0,0,1,0,0)").show() ++ |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| ++ | 2011-03-13 03:00:00| ++ {noformat} The current resample code uses the above interval based calculation. was: Use one of the existing test: - "11H" case of test_dataframe_resample (pyspark.pandas.tests.test_resample.ResampleTests) - "1001H" case of test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) After setting the TZ for example to New York. Like by using the following python code in a "setUpClass": {noformat} os.environ["TZ"] = 'America/New_York' {noformat} You will get the error for the latter one: {noformat} == FAIL [4.219s]: test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 276, in test_series_resample self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum") File "/__w/spark/spark/python
[jira] [Commented] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
[ https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751887#comment-17751887 ] Hyukjin Kwon commented on SPARK-44717: -- cc [~podongfeng] > "pyspark.pandas.resample" is incorrect when DST is overlapped and setting > "spark.sql.timestampType" to TIMESTAMP_NTZ does not help > -- > > Key: SPARK-44717 > URL: https://issues.apache.org/jira/browse/SPARK-44717 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.4.0, 3.4.1, 4.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Use one of the existing test: > - "11H" case of test_dataframe_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > - "1001H" case of test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > After setting the TZ for example to New York. Like by using the following > python code in a "setUpClass": > {noformat} > os.environ["TZ"] = 'America/New_York' > {noformat} > You will get the error for the latter one: > {noformat} > == > FAIL [4.219s]: test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 276, in test_series_resample > self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", > "right", "sum") > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 259, in _test_resample > self.assert_eq( > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in > assert_eq > _assert_pandas_almost_equal(lobj, robj) > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in > _assert_pandas_almost_equal > raise PySparkAssertionError( > pyspark.errors.exceptions.base.PySparkAssertionError: > [DIFFERENT_PANDAS_SERIES] Series are not almost equal: > Left: > Freq: 1001H > float64 > Right: > float64 > {noformat} > The problem is the in the pyspark resample there will be more resampled rows > in the result. The DST change will cause those extra lines as the computed > __tmp_resample_bin_col__ be something like: > {noformat} > | __index_level_0__ | __tmp_resample_bin_col__ | A > . > |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | > |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | > |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | > |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | > |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | > |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | > |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | > |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | > |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | > |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| > |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 | > {noformat} > You can see the extra lines around when the DST kicked in on 2011-03-13 in > New York. > Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not > help. > You can see my tests here: > https://github.com/attilapiros/spark/pull/5 > Pandas timestamps are TZ less: > ` > {noformat} > import pandas as pd > a = pd.Timestamp(year=2011, month=3, day=13, hour=1) > b = pd.Timedelta(hours=1) > >> a > Timestamp('2011-03-13 01:00:00') > >>> a+b > Timestamp('2011-03-13 02:00:00') > >>> a+b+b > Timestamp('2011-03-13 03:00:00') > {noformat} > But pyspark TimestampType uses TZ and DST: > {noformat} > >>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() > +---+ > |TIMESTAMP '2011-03-13 01:00:00'| > +---+ > |2011-03-13 01:00:00| > +---+ > >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + > >>> make_interval(0,0,0,0,1,0,0)").show() > ++ > |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| > ++ > | 2011-03-13 03:00:00| > ++ > {noformat} > The current resample code uses the above interval based calculation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To uns
[jira] [Updated] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
[ https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-44717: --- Summary: "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help (was: "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied) > "pyspark.pandas.resample" is incorrect when DST is overlapped and setting > "spark.sql.timestampType" to TIMESTAMP_NTZ does not help > -- > > Key: SPARK-44717 > URL: https://issues.apache.org/jira/browse/SPARK-44717 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.4.0, 3.4.1, 4.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Use one of the existing test: > - "11H" case of test_dataframe_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > - "1001H" case of test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > After setting the TZ for example to New York. Like by using the following > python code in a "setUpClass": > {noformat} > os.environ["TZ"] = 'America/New_York' > {noformat} > You will get the error for the latter one: > {noformat} > == > FAIL [4.219s]: test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 276, in test_series_resample > self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", > "right", "sum") > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 259, in _test_resample > self.assert_eq( > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in > assert_eq > _assert_pandas_almost_equal(lobj, robj) > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in > _assert_pandas_almost_equal > raise PySparkAssertionError( > pyspark.errors.exceptions.base.PySparkAssertionError: > [DIFFERENT_PANDAS_SERIES] Series are not almost equal: > Left: > Freq: 1001H > float64 > Right: > float64 > {noformat} > The problem is the in the pyspark resample there will be more resampled rows > in the result. The DST change will cause those extra lines as the computed > __tmp_resample_bin_col__ be something like: > {noformat} > | __index_level_0__ | __tmp_resample_bin_col__ | A > . > |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | > |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | > |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | > |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | > |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | > |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | > |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | > |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | > |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | > |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| > |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 | > {noformat} > You can see the extra lines around when the DST kicked in on 2011-03-13 in > New York. > Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not > help. > You can see my tests here: > https://github.com/attilapiros/spark/pull/5 > Pandas timestamps are TZ less: > ` > {noformat} > import pandas as pd > a = pd.Timestamp(year=2011, month=3, day=13, hour=1) > b = pd.Timedelta(hours=1) > >> a > Timestamp('2011-03-13 01:00:00') > >>> a+b > Timestamp('2011-03-13 02:00:00') > >>> a+b+b > Timestamp('2011-03-13 03:00:00') > {noformat} > But pyspark TimestampType uses TZ and DST: > {noformat} > >>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() > +---+ > |TIMESTAMP '2011-03-13 01:00:00'| > +---+ > |2011-03-13 01:00:00| > +---+ > >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + > >>> make_interval(0,0,0,0,1,0,0)").show() > ++ > |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| > ++ > | 2011-03-13 03:00:00| > ++
[jira] [Commented] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied
[ https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751886#comment-17751886 ] Attila Zsolt Piros commented on SPARK-44717: cc [~itholic] [~gurwls223] > "pyspark.pandas.resample" is incorrect when DST is overlapped and setting > "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied > - > > Key: SPARK-44717 > URL: https://issues.apache.org/jira/browse/SPARK-44717 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.4.0, 3.4.1, 4.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Use one of the existing test: > - "11H" case of test_dataframe_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > - "1001H" case of test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > After setting the TZ for example to New York. Like by using the following > python code in a "setUpClass": > {noformat} > os.environ["TZ"] = 'America/New_York' > {noformat} > You will get the error for the latter one: > {noformat} > == > FAIL [4.219s]: test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 276, in test_series_resample > self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", > "right", "sum") > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 259, in _test_resample > self.assert_eq( > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in > assert_eq > _assert_pandas_almost_equal(lobj, robj) > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in > _assert_pandas_almost_equal > raise PySparkAssertionError( > pyspark.errors.exceptions.base.PySparkAssertionError: > [DIFFERENT_PANDAS_SERIES] Series are not almost equal: > Left: > Freq: 1001H > float64 > Right: > float64 > {noformat} > The problem is the in the pyspark resample there will be more resampled rows > in the result. The DST change will cause those extra lines as the computed > __tmp_resample_bin_col__ be something like: > {noformat} > | __index_level_0__.| __tmp_resample_bin_col__ | A > . > |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | > |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | > |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | > |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | > |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | > |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | > |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | > |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | > |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | > |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| > |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 | > {noformat} > You can see the extra lines around when the DST kicked in on 2011-03-13 in > New York. > Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not > help. > You can see my tests here: > https://github.com/attilapiros/spark/pull/5 > Pandas timestamps are TZ less: > ` > {noformat} > import pandas as pd > a = pd.Timestamp(year=2011, month=3, day=13, hour=1) > b = pd.Timedelta(hours=1) > >> a > Timestamp('2011-03-13 01:00:00') > >>> a+b > Timestamp('2011-03-13 02:00:00') > >>> a+b+b > Timestamp('2011-03-13 03:00:00') > {noformat} > But pyspark TimestampType uses TZ and DST: > {noformat} > >>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() > +---+ > |TIMESTAMP '2011-03-13 01:00:00'| > +---+ > |2011-03-13 01:00:00| > +---+ > >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + > >>> make_interval(0,0,0,0,1,0,0)").show() > ++ > |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| > ++ > | 2011-03-13 03:00:00| > ++ > {noformat} > The current resample code uses the above interval based calculation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied
[ https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-44717: --- Description: Use one of the existing test: - "11H" case of test_dataframe_resample (pyspark.pandas.tests.test_resample.ResampleTests) - "1001H" case of test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) After setting the TZ for example to New York. Like by using the following python code in a "setUpClass": {noformat} os.environ["TZ"] = 'America/New_York' {noformat} You will get the error for the latter one: {noformat} == FAIL [4.219s]: test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 276, in test_series_resample self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum") File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 259, in _test_resample self.assert_eq( File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in assert_eq _assert_pandas_almost_equal(lobj, robj) File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in _assert_pandas_almost_equal raise PySparkAssertionError( pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] Series are not almost equal: Left: Freq: 1001H float64 Right: float64 {noformat} The problem is the in the pyspark resample there will be more resampled rows in the result. The DST change will cause those extra lines as the computed __tmp_resample_bin_col__ be something like: {noformat} | __index_level_0__ | __tmp_resample_bin_col__ | A . |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 | {noformat} You can see the extra lines around when the DST kicked in on 2011-03-13 in New York. Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help. You can see my tests here: https://github.com/attilapiros/spark/pull/5 Pandas timestamps are TZ less: ` {noformat} import pandas as pd a = pd.Timestamp(year=2011, month=3, day=13, hour=1) b = pd.Timedelta(hours=1) >> a Timestamp('2011-03-13 01:00:00') >>> a+b Timestamp('2011-03-13 02:00:00') >>> a+b+b Timestamp('2011-03-13 03:00:00') {noformat} But pyspark TimestampType uses TZ and DST: {noformat} >>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() +---+ |TIMESTAMP '2011-03-13 01:00:00'| +---+ |2011-03-13 01:00:00| +---+ >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + >>> make_interval(0,0,0,0,1,0,0)").show() ++ |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| ++ | 2011-03-13 03:00:00| ++ {noformat} The current resample code uses the above interval based calculation. was: Use one of the existing test: - "11H" case of test_dataframe_resample (pyspark.pandas.tests.test_resample.ResampleTests) - "1001H" case of test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) After setting the TZ for example to New York. Like by using the following python code in a "setUpClass": {noformat} os.environ["TZ"] = 'America/New_York' {noformat} You will get the error for the latter one: {noformat} == FAIL [4.219s]: test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 276, in test_series_resample self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum") File "/__w/spark/spark/python/pyspark/pan
[jira] [Updated] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied
[ https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-44717: --- Description: Use one of the existing test: - "11H" case of test_dataframe_resample (pyspark.pandas.tests.test_resample.ResampleTests) - "1001H" case of test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) After setting the TZ for example to New York. Like by using the following python code in a "setUpClass": {noformat} os.environ["TZ"] = 'America/New_York' {noformat} You will get the error for the latter one: {noformat} == FAIL [4.219s]: test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 276, in test_series_resample self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum") File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 259, in _test_resample self.assert_eq( File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in assert_eq _assert_pandas_almost_equal(lobj, robj) File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in _assert_pandas_almost_equal raise PySparkAssertionError( pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] Series are not almost equal: Left: Freq: 1001H float64 Right: float64 {noformat} The problem is the in the pyspark resample there will be more resampled rows in the result. The DST change will cause those extra lines as the computed __tmp_resample_bin_col__ be something like: {noformat} | __index_level_0__.| __tmp_resample_bin_col__ | A . |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 | {noformat} You can see the extra lines around when the DST kicked in on 2011-03-13 in New York. Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help. You can see my tests here: https://github.com/attilapiros/spark/pull/5 Pandas timestamps are TZ less: ` {noformat} import pandas as pd a = pd.Timestamp(year=2011, month=3, day=13, hour=1) b = pd.Timedelta(hours=1) >> a Timestamp('2011-03-13 01:00:00') >>> a+b Timestamp('2011-03-13 02:00:00') >>> a+b+b Timestamp('2011-03-13 03:00:00') {noformat} But pyspark TimestampType uses TZ and DST: {noformat} >>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() +---+ |TIMESTAMP '2011-03-13 01:00:00'| +---+ |2011-03-13 01:00:00| +---+ >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + >>> make_interval(0,0,0,0,1,0,0)").show() ++ |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| ++ | 2011-03-13 03:00:00| ++ {noformat} The current resample code uses the above interval based calculation. was: Use one of the existing test: - "11H" case of test_dataframe_resample (pyspark.pandas.tests.test_resample.ResampleTests) -"1001H" case of test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) After setting the TZ for example to New York. Like by using the following python code in a "setUpClass": {noformat} os.environ["TZ"] = 'America/New_York' {noformat} You will get the error for the latter one: {noformat} == FAIL [4.219s]: test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 276, in test_series_resample self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum") File "/__w/spark/spark/python/pyspark/p
[jira] [Created] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied
Attila Zsolt Piros created SPARK-44717: -- Summary: "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied Key: SPARK-44717 URL: https://issues.apache.org/jira/browse/SPARK-44717 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.4.1, 3.4.0, 4.0.0 Reporter: Attila Zsolt Piros Use one of the existing test: - "11H" case of test_dataframe_resample (pyspark.pandas.tests.test_resample.ResampleTests) -"1001H" case of test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) After setting the TZ for example to New York. Like by using the following python code in a "setUpClass": {noformat} os.environ["TZ"] = 'America/New_York' {noformat} You will get the error for the latter one: {noformat} == FAIL [4.219s]: test_series_resample (pyspark.pandas.tests.test_resample.ResampleTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 276, in test_series_resample self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", "sum") File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 259, in _test_resample self.assert_eq( File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in assert_eq _assert_pandas_almost_equal(lobj, robj) File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in _assert_pandas_almost_equal raise PySparkAssertionError( pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] Series are not almost equal: Left: Freq: 1001H float64 Right: float64 {noformat} The problem is the in the pyspark resample there will be more resampled rows in the result. The DST change will cause those extra lines as the computed __tmp_resample_bin_col__ be something like: {noformat} | __index_level_0__.| __tmp_resample_bin_col__ | A . |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 | {noformat} You can see the extra lines around when the DST kicked in on 2011-03-13 in New York. Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help. You can see my tests here: https://github.com/attilapiros/spark/pull/5 Pandas timestamps are TZ less: ` {noformat} import pandas as pd a = pd.Timestamp(year=2011, month=3, day=13, hour=1) b = pd.Timedelta(hours=1) >> a Timestamp('2011-03-13 01:00:00') >>> a+b Timestamp('2011-03-13 02:00:00') >>> a+b+b Timestamp('2011-03-13 03:00:00') {noformat} But pyspark TimestampType uses TZ and DST: {noformat} >>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() +---+ |TIMESTAMP '2011-03-13 01:00:00'| +---+ |2011-03-13 01:00:00| +---+ >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + >>> make_interval(0,0,0,0,1,0,0)").show() ++ |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| ++ | 2011-03-13 03:00:00| ++ {noformat} The current resample code uses the above interval based calculation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-44716) Add missing callUDF in Python Spark Connect client
[ https://issues.apache.org/jira/browse/SPARK-44716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon deleted SPARK-44716: - > Add missing callUDF in Python Spark Connect client > -- > > Key: SPARK-44716 > URL: https://issues.apache.org/jira/browse/SPARK-44716 > Project: Spark > Issue Type: Task >Reporter: Hyukjin Kwon >Priority: Major > > See also SPARK-44715 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44716) Add missing callUDF in Python Spark Connect client
Hyukjin Kwon created SPARK-44716: Summary: Add missing callUDF in Python Spark Connect client Key: SPARK-44716 URL: https://issues.apache.org/jira/browse/SPARK-44716 Project: Spark Issue Type: Task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Hyukjin Kwon See also SPARK-44715 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44703) Log eventLog rewrite duration when compact old event log files
[ https://issues.apache.org/jira/browse/SPARK-44703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-44703. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42378 [https://github.com/apache/spark/pull/42378] > Log eventLog rewrite duration when compact old event log files > -- > > Key: SPARK-44703 > URL: https://issues.apache.org/jira/browse/SPARK-44703 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: shuyouZZ >Assignee: shuyouZZ >Priority: Major > Fix For: 4.0.0 > > > When enable {{spark.eventLog.rolling.enabled}} and the number of eventLog > files exceeds the value of > {{{}spark.history.fs.eventLog.rolling.maxFilesToRetain{}}}, > HistoryServer will compact the old event log files into one compact file. > Currently there is no log the rewrite duration in {{rewrite}} method, this > metric is useful for understand the compact duration, so we need add logs in > the method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44703) Log eventLog rewrite duration when compact old event log files
[ https://issues.apache.org/jira/browse/SPARK-44703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-44703: Assignee: shuyouZZ > Log eventLog rewrite duration when compact old event log files > -- > > Key: SPARK-44703 > URL: https://issues.apache.org/jira/browse/SPARK-44703 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: shuyouZZ >Assignee: shuyouZZ >Priority: Major > > When enable {{spark.eventLog.rolling.enabled}} and the number of eventLog > files exceeds the value of > {{{}spark.history.fs.eventLog.rolling.maxFilesToRetain{}}}, > HistoryServer will compact the old event log files into one compact file. > Currently there is no log the rewrite duration in {{rewrite}} method, this > metric is useful for understand the compact duration, so we need add logs in > the method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44703) Log eventLog rewrite duration when compact old event log files
[ https://issues.apache.org/jira/browse/SPARK-44703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-44703: - Fix Version/s: (was: 3.5.0) > Log eventLog rewrite duration when compact old event log files > -- > > Key: SPARK-44703 > URL: https://issues.apache.org/jira/browse/SPARK-44703 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: shuyouZZ >Priority: Major > > When enable {{spark.eventLog.rolling.enabled}} and the number of eventLog > files exceeds the value of > {{{}spark.history.fs.eventLog.rolling.maxFilesToRetain{}}}, > HistoryServer will compact the old event log files into one compact file. > Currently there is no log the rewrite duration in {{rewrite}} method, this > metric is useful for understand the compact duration, so we need add logs in > the method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44005) Improve error messages for regular Python UDTFs that return non-tuple values
[ https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44005. --- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42353 [https://github.com/apache/spark/pull/42353] > Improve error messages for regular Python UDTFs that return non-tuple values > > > Key: SPARK-44005 > URL: https://issues.apache.org/jira/browse/SPARK-44005 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Currently, if you have a UDTF like this: > {code:java} > class TestUDTF: > def eval(self, a: int): > yield a {code} > and run the UDTF, it will fail with a confusing error message like > {code:java} > Unexpected tuple 1 with StructType {code} > Note this works when arrow is enabled. We should improve error messages for > regular UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44005) Improve error messages for regular Python UDTFs that return non-tuple values
[ https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44005: - Assignee: Allison Wang > Improve error messages for regular Python UDTFs that return non-tuple values > > > Key: SPARK-44005 > URL: https://issues.apache.org/jira/browse/SPARK-44005 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Currently, if you have a UDTF like this: > {code:java} > class TestUDTF: > def eval(self, a: int): > yield a {code} > and run the UDTF, it will fail with a confusing error message like > {code:java} > Unexpected tuple 1 with StructType {code} > Note this works when arrow is enabled. We should improve error messages for > regular UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44689) `UDFClassLoadingE2ESuite` failed in the daily test for Java 17.
[ https://issues.apache.org/jira/browse/SPARK-44689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44689. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42360 [https://github.com/apache/spark/pull/42360] > `UDFClassLoadingE2ESuite` failed in the daily test for Java 17. > --- > > Key: SPARK-44689 > URL: https://issues.apache.org/jira/browse/SPARK-44689 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > [https://github.com/apache/spark/actions/runs/5766913899/job/15635782831] > > {code:java} > [info] - update class loader after stubbing: new session *** FAILED *** (101 > milliseconds) > 6233[info] "Exception in SerializedLambda.readResolve" did not contain > "java.lang.NoSuchMethodException: > org.apache.spark.sql.connect.client.StubClassDummyUdf" > (UDFClassLoadingE2ESuite.scala:57) > 6234[info] org.scalatest.exceptions.TestFailedException: > 6235[info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > 6236[info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > 6237[info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > 6238[info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > 6239[info] at > org.apache.spark.sql.connect.client.UDFClassLoadingE2ESuite.$anonfun$new$1(UDFClassLoadingE2ESuite.scala:57) > 6240[info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 6241[info] at > org.apache.spark.sql.connect.client.util.RemoteSparkSession.$anonfun$test$1(RemoteSparkSession.scala:243) > 6242[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > 6243[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > 6244[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > 6245[info] at org.scalatest.Transformer.apply(Transformer.scala:22) > 6246[info] at org.scalatest.Transformer.apply(Transformer.scala:20) > 6247[info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > 6248[info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > 6249[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > 6250[info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > 6251[info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > 6252[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > 6253[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > 6254[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > 6255[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > 6256[info] at > org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > 6257[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > 6258[info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > 6259[info] at scala.collection.immutable.List.foreach(List.scala:431) > 6260[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > 6261[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > 6262[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > 6263[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > 6264[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > 6265[info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > 6266[info] at org.scalatest.Suite.run(Suite.scala:1114) > 6267[info] at org.scalatest.Suite.run$(Suite.scala:1096) > 6268[info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > 6269[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > 6270[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > 6271[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > 6272[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > 6273[info] at > org.apache.spark.sql.connect.client.UDFClassLoadingE2ESuite.org$scalatest$BeforeAndAfterAll$$super$run(UDFClassLoadingE2ESuite.scala:29) > 6274[info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > 6275[info] at >
[jira] [Assigned] (SPARK-44689) `UDFClassLoadingE2ESuite` failed in the daily test for Java 17.
[ https://issues.apache.org/jira/browse/SPARK-44689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44689: Assignee: Yang Jie > `UDFClassLoadingE2ESuite` failed in the daily test for Java 17. > --- > > Key: SPARK-44689 > URL: https://issues.apache.org/jira/browse/SPARK-44689 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0, 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > [https://github.com/apache/spark/actions/runs/5766913899/job/15635782831] > > {code:java} > [info] - update class loader after stubbing: new session *** FAILED *** (101 > milliseconds) > 6233[info] "Exception in SerializedLambda.readResolve" did not contain > "java.lang.NoSuchMethodException: > org.apache.spark.sql.connect.client.StubClassDummyUdf" > (UDFClassLoadingE2ESuite.scala:57) > 6234[info] org.scalatest.exceptions.TestFailedException: > 6235[info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > 6236[info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > 6237[info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > 6238[info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > 6239[info] at > org.apache.spark.sql.connect.client.UDFClassLoadingE2ESuite.$anonfun$new$1(UDFClassLoadingE2ESuite.scala:57) > 6240[info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > 6241[info] at > org.apache.spark.sql.connect.client.util.RemoteSparkSession.$anonfun$test$1(RemoteSparkSession.scala:243) > 6242[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > 6243[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > 6244[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > 6245[info] at org.scalatest.Transformer.apply(Transformer.scala:22) > 6246[info] at org.scalatest.Transformer.apply(Transformer.scala:20) > 6247[info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > 6248[info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > 6249[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > 6250[info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > 6251[info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > 6252[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > 6253[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > 6254[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > 6255[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > 6256[info] at > org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > 6257[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > 6258[info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > 6259[info] at scala.collection.immutable.List.foreach(List.scala:431) > 6260[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > 6261[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > 6262[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > 6263[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > 6264[info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > 6265[info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > 6266[info] at org.scalatest.Suite.run(Suite.scala:1114) > 6267[info] at org.scalatest.Suite.run$(Suite.scala:1096) > 6268[info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > 6269[info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > 6270[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > 6271[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > 6272[info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > 6273[info] at > org.apache.spark.sql.connect.client.UDFClassLoadingE2ESuite.org$scalatest$BeforeAndAfterAll$$super$run(UDFClassLoadingE2ESuite.scala:29) > 6274[info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > 6275[info] at > org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > 6276[info] at > org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > 6277[info] a
[jira] [Resolved] (SPARK-44554) Install different Python linter dependencies for daily testing of different Spark versions
[ https://issues.apache.org/jira/browse/SPARK-44554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44554. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42167 [https://github.com/apache/spark/pull/42167] > Install different Python linter dependencies for daily testing of different > Spark versions > -- > > Key: SPARK-44554 > URL: https://issues.apache.org/jira/browse/SPARK-44554 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 4.0.0 > > > Fix daily test python lint check failure for branches 3.3 and 3.4 > > 3.4 : > [https://github.com/apache/spark/actions/runs/5654787844/job/15318633266] > 3.3 : https://github.com/apache/spark/actions/runs/5653655970/job/15315236052 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44554) Install different Python linter dependencies for daily testing of different Spark versions
[ https://issues.apache.org/jira/browse/SPARK-44554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44554: Assignee: Yang Jie > Install different Python linter dependencies for daily testing of different > Spark versions > -- > > Key: SPARK-44554 > URL: https://issues.apache.org/jira/browse/SPARK-44554 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > Fix daily test python lint check failure for branches 3.3 and 3.4 > > 3.4 : > [https://github.com/apache/spark/actions/runs/5654787844/job/15318633266] > 3.3 : https://github.com/apache/spark/actions/runs/5653655970/job/15315236052 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44641) SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet
[ https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-44641. -- Fix Version/s: 3.4.2 3.5.0 Assignee: Chao Sun Resolution: Fixed > SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but > conditions unmet > -- > > Key: SPARK-44641 > URL: https://issues.apache.org/jira/browse/SPARK-44641 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: Szehon Ho >Assignee: Chao Sun >Priority: Blocker > Fix For: 3.4.2, 3.5.0 > > > Adding the following test case in KeyGroupedPartitionSuite demonstrates the > problem. > > {code:java} > test("test join key is the second partition key and a transform") { > val items_partitions = Array(bucket(8, "id"), days("arrive_time")) > createTable(items, items_schema, items_partitions) > sql(s"INSERT INTO testcat.ns.$items VALUES " + > s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " + > s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " + > s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " + > s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " + > s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))") > val purchases_partitions = Array(bucket(8, "item_id"), days("time")) > createTable(purchases, purchases_schema, purchases_partitions) > sql(s"INSERT INTO testcat.ns.$purchases VALUES " + > s"(1, 42.0, cast('2020-01-01' as timestamp)), " + > s"(1, 44.0, cast('2020-01-15' as timestamp)), " + > s"(1, 45.0, cast('2020-01-15' as timestamp)), " + > s"(2, 11.0, cast('2020-01-01' as timestamp)), " + > s"(3, 19.5, cast('2020-02-01' as timestamp))") > withSQLConf( > SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false", > SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true", > SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key -> > "true") { > val df = sql("SELECT id, name, i.price as purchase_price, " + > "p.item_id, p.price as sale_price " + > s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " + > "ON i.arrive_time = p.time " + > "ORDER BY id, purchase_price, p.item_id, sale_price") > val shuffles = collectShuffles(df.queryExecution.executedPlan) > assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys > are partition keys") > checkAnswer(df, > Seq( > Row(1, "aa", 40.0, 1, 42.0), > Row(1, "aa", 40.0, 2, 11.0), > Row(1, "aa", 41.0, 1, 44.0), > Row(1, "aa", 41.0, 1, 45.0), > Row(2, "bb", 10.0, 1, 42.0), > Row(2, "bb", 10.0, 2, 11.0), > Row(2, "bb", 10.5, 1, 42.0), > Row(2, "bb", 10.5, 2, 11.0), > Row(3, "cc", 15.5, 3, 19.5) > ) > ) > } > }{code} > > Note: this tests has setup the datasourceV2 to return multiple splits for > same partition. > In this case, SPJ is not triggered (because join key does not match partition > key), but the following code in DSV2Scan: > [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194] > intended to fill the empty partition for 'pushdown-vallue' will still iterate > through non-grouped partition and lookup from grouped partition to fill the > map, resulting in some duplicate input data fed into the join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly
[ https://issues.apache.org/jira/browse/SPARK-44683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-44683: - Fix Version/s: 3.5.1 (was: 3.5.0) > Logging level isn't passed to RocksDB state store provider correctly > > > Key: SPARK-44683 > URL: https://issues.apache.org/jira/browse/SPARK-44683 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.1 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Fix For: 4.0.0, 3.5.1 > > > We pass log4j's log level to RocksDB so that RocksDB debug log can go to > log4j. However, we pass in log level after we create the logger. However, the > way it is set isn't effective. This has two impacts: (1) setting DEBUG level > don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level > does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, > which is an unnecessary overhead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly
[ https://issues.apache.org/jira/browse/SPARK-44683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-44683: Assignee: Siying Dong > Logging level isn't passed to RocksDB state store provider correctly > > > Key: SPARK-44683 > URL: https://issues.apache.org/jira/browse/SPARK-44683 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.1 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > > We pass log4j's log level to RocksDB so that RocksDB debug log can go to > log4j. However, we pass in log level after we create the logger. However, the > way it is set isn't effective. This has two impacts: (1) setting DEBUG level > don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level > does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, > which is an unnecessary overhead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly
[ https://issues.apache.org/jira/browse/SPARK-44683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-44683. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42354 [https://github.com/apache/spark/pull/42354] > Logging level isn't passed to RocksDB state store provider correctly > > > Key: SPARK-44683 > URL: https://issues.apache.org/jira/browse/SPARK-44683 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.4.1 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Fix For: 3.5.0, 4.0.0 > > > We pass log4j's log level to RocksDB so that RocksDB debug log can go to > log4j. However, we pass in log level after we create the logger. However, the > way it is set isn't effective. This has two impacts: (1) setting DEBUG level > don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level > does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, > which is an unnecessary overhead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44694) Add Default & Active SparkSession for Python Client
[ https://issues.apache.org/jira/browse/SPARK-44694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44694. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42371 [https://github.com/apache/spark/pull/42371] > Add Default & Active SparkSession for Python Client > --- > > Key: SPARK-44694 > URL: https://issues.apache.org/jira/browse/SPARK-44694 > Project: Spark > Issue Type: Task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > SPARK-43429 for Python side -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44694) Add Default & Active SparkSession for Python Client
[ https://issues.apache.org/jira/browse/SPARK-44694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44694: Assignee: Hyukjin Kwon > Add Default & Active SparkSession for Python Client > --- > > Key: SPARK-44694 > URL: https://issues.apache.org/jira/browse/SPARK-44694 > Project: Spark > Issue Type: Task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > SPARK-43429 for Python side -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44715) Add missing udf and callUdf functions
Herman van Hövell created SPARK-44715: - Summary: Add missing udf and callUdf functions Key: SPARK-44715 URL: https://issues.apache.org/jira/browse/SPARK-44715 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.5.0 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44705) Make PythonRunner single-threaded
[ https://issues.apache.org/jira/browse/SPARK-44705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751843#comment-17751843 ] Utkarsh Agarwal commented on SPARK-44705: - https://github.com/apache/spark/pull/42385 for this issue > Make PythonRunner single-threaded > - > > Key: SPARK-44705 > URL: https://issues.apache.org/jira/browse/SPARK-44705 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Utkarsh Agarwal >Priority: Major > > PythonRunner, a utility that executes Python UDFs in Spark, uses two threads > in a producer-consumer model today. This multi-threading model is problematic > and confusing as Spark's execution model within a task is commonly understood > to be single-threaded. > More importantly, this departure of a double-threaded execution resulted in a > series of customer issues involving [race > conditions|https://issues.apache.org/jira/browse/SPARK-33277] and > [deadlocks|https://issues.apache.org/jira/browse/SPARK-38677] between threads > as the code was hard to reason about. There have been multiple attempts to > reign in these issues, viz., [fix > 1|https://issues.apache.org/jira/browse/SPARK-22535], [fix > 2|https://github.com/apache/spark/pull/30177], [fix > 3|https://github.com/apache/spark/commit/243c321db2f02f6b4d926114bd37a6e74c2be185]. > Moreover, the fixes have made the code base somewhat abstruse by introducing > multiple daemon [monitor > threads|https://github.com/apache/spark/blob/a3a32912be04d3760cb34eb4b79d6d481bbec502/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L579] > to detect deadlocks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44714) Ease restriction of LCA resolution regarding queries with having
[ https://issues.apache.org/jira/browse/SPARK-44714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinyi Yu updated SPARK-44714: - Description: Current LCA resolution has a limitation, that it can't resolve the query, when it satisfies all the following criteria: # the main (outer) query has having clause # there is a window expression in the query # in the same SELECT list as the window expression in 2), there is an lca This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; window expressions won't be extracted until LCA in the same SELECT lists are rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, which could include the Window. It becomes a deadlock. *We should ease some limitation on the LCA resolution regarding to having, to break the deadlock for most cases.* For example, for the following query: {code:java} create table t (col boolean) using orc; with w AS ( select min(col) over () as min_alias, min_alias as col_alias FROM t ) select col_alias from w having count > 0; {code} It now throws confusing error message: {code:java} [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `col_alias` cannot be resolved. Did you mean one of the following? [`col_alias`, `min_alias`].{code} The LCA and window is in a CTE that is completely unrelated to the having. LCA should resolve in this case. was: Current LCA resolution has a limitation, that it can't resolve the query, when it satisfies all the following criteria: # the main (outer) query has having clause # there is a window expression in the query # in the same SELECT list as the window expression in 2), there is an lca This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; window expressions won't be extracted until LCA in the same SELECT lists are rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, which could include the Window. It becomes a deadlock. *We should ease some limitation on the LCA resolution regarding to having, to break the deadlock for most cases.* For example, for the following query: create table t (col boolean) using orc; with w AS (select min(col) over () as min_alias, min_alias as col_aliasFROM t )select col_alias from whaving count(*) > 0; It now throws confusing error message: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `col_alias` cannot be resolved. Did you mean one of the following? [`col_alias`, `min_alias`]. The LCA and window is in a CTE that is completely unrelated to the having. LCA should resolve in this case. > Ease restriction of LCA resolution regarding queries with having > > > Key: SPARK-44714 > URL: https://issues.apache.org/jira/browse/SPARK-44714 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.2, 3.5.0 >Reporter: Xinyi Yu >Priority: Major > > Current LCA resolution has a limitation, that it can't resolve the query, > when it satisfies all the following criteria: > # the main (outer) query has having clause > # there is a window expression in the query > # in the same SELECT list as the window expression in 2), there is an lca > This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; > window expressions won't be extracted until LCA in the same SELECT lists are > rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, > which could include the Window. It becomes a deadlock. > *We should ease some limitation on the LCA resolution regarding to having, to > break the deadlock for most cases.* > For example, for the following query: > {code:java} > create table t (col boolean) using orc; > with w AS ( > select min(col) over () as min_alias, > min_alias as col_alias > FROM t > ) > select col_alias > from w > having count > 0; > {code} > > It now throws confusing error message: > {code:java} > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name > `col_alias` > cannot be resolved. Did you mean one of the following? [`col_alias`, > `min_alias`].{code} > The LCA and window is in a CTE that is completely unrelated to the having. > LCA should resolve in this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44714) Ease restriction of LCA resolution regarding queries with having
Xinyi Yu created SPARK-44714: Summary: Ease restriction of LCA resolution regarding queries with having Key: SPARK-44714 URL: https://issues.apache.org/jira/browse/SPARK-44714 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.2, 3.5.0 Reporter: Xinyi Yu Current LCA resolution has a limitation, that it can't resolve the query, when it satisfies all the following criteria: # the main (outer) query has having clause # there is a window expression in the query # in the same SELECT list as the window expression in 2), there is an lca This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; window expressions won't be extracted until LCA in the same SELECT lists are rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, which could include the Window. It becomes a deadlock. *We should ease some limitation on the LCA resolution regarding to having, to break the deadlock for most cases.* For example, for the following query: create table t (col boolean) using orc; with w AS (select min(col) over () as min_alias, min_alias as col_aliasFROM t )select col_alias from whaving count(*) > 0; It now throws confusing error message: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `col_alias` cannot be resolved. Did you mean one of the following? [`col_alias`, `min_alias`]. The LCA and window is in a CTE that is completely unrelated to the having. LCA should resolve in this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44713) Deduplicate files between sql/core and Spark Connect Scala Client
Herman van Hövell created SPARK-44713: - Summary: Deduplicate files between sql/core and Spark Connect Scala Client Key: SPARK-44713 URL: https://issues.apache.org/jira/browse/SPARK-44713 Project: Spark Issue Type: New Feature Components: Connect, SQL Affects Versions: 3.5.0 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44712) Migrate test_timedelta_ops assert_eq to use assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44712: --- Description: Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py|https://github.com/databricks/runtime/blob/f579860299b16f6614f70c7cf2509cd89816d363/python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py#L176] (was: Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]) > Migrate test_timedelta_ops assert_eq to use assertDataFrameEqual > - > > Key: SPARK-44712 > URL: https://issues.apache.org/jira/browse/SPARK-44712 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > Migrate assert_eq to assertDataFrameEqual in this file: > [python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py|https://github.com/databricks/runtime/blob/f579860299b16f6614f70c7cf2509cd89816d363/python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py#L176] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44712) Migrate test_timedelta_ops assert_eq to use assertDataFrameEqual
Amanda Liu created SPARK-44712: -- Summary: Migrate test_timedelta_ops assert_eq to use assertDataFrameEqual Key: SPARK-44712 URL: https://issues.apache.org/jira/browse/SPARK-44712 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44711) Migrate test_series_conversion assert_eq to use assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44711: --- Description: Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/test_series_conversion.py|https://github.com/databricks/runtime/blob/d162bd182de8bcea180d874027edb86ae8fc60e5/python/pyspark/pandas/tests/test_series_conversion.py#L63] was:Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46] > Migrate test_series_conversion assert_eq to use assertDataFrameEqual > > > Key: SPARK-44711 > URL: https://issues.apache.org/jira/browse/SPARK-44711 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > Migrate assert_eq to assertDataFrameEqual in this file: > [python/pyspark/pandas/tests/test_series_conversion.py|https://github.com/databricks/runtime/blob/d162bd182de8bcea180d874027edb86ae8fc60e5/python/pyspark/pandas/tests/test_series_conversion.py#L63] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44711) Migrate test_series_conversion assert_eq to use assertDataFrameEqual
Amanda Liu created SPARK-44711: -- Summary: Migrate test_series_conversion assert_eq to use assertDataFrameEqual Key: SPARK-44711 URL: https://issues.apache.org/jira/browse/SPARK-44711 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44710) Support Dataset.dropDuplicatesWithinWatermark in Scala Client
Herman van Hövell created SPARK-44710: - Summary: Support Dataset.dropDuplicatesWithinWatermark in Scala Client Key: SPARK-44710 URL: https://issues.apache.org/jira/browse/SPARK-44710 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.5.0 Reporter: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44692) Move Trigger.scala to sql/api
[ https://issues.apache.org/jira/browse/SPARK-44692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44692. --- Fix Version/s: 3.5.0 Resolution: Fixed > Move Trigger.scala to sql/api > - > > Key: SPARK-44692 > URL: https://issues.apache.org/jira/browse/SPARK-44692 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44132) nesting full outer joins confuses code generator
[ https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44132. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 41712 [https://github.com/apache/spark/pull/41712] > nesting full outer joins confuses code generator > > > Key: SPARK-44132 > URL: https://issues.apache.org/jira/browse/SPARK-44132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 > Environment: We verified the existence of this bug from spark 3.3 > until spark 3.5. >Reporter: Steven Aerts >Assignee: Steven Aerts >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > We are seeing issues with the code generator when querying java bean encoded > data with 2 nested joins. > {code:java} > dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); > {code} > will generate invalid code in the code generator. And can depending on the > data used generate stack traces like: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > Or: > {code:java} > Caused by: java.lang.AssertionError: index (2) should < 2 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > When we look at the generated code we see that the code generator seems to be > mixing up parameters. For example: > {code:java} > if (smj_leftOutputRow_0 != null) { //< null > check for wrong/left parameter > boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes > NPE on right parameter here{code} > It is as if the the nesting of 2 full outer joins is confusing the code > generator and as such generating invalid code. > There is one other strange thing. We found this issue when using data sets > which were using the java bean encoder. We tried to reproduce this in the > spark shell or using scala case classes but were unable to do so. > We made a reproduction scenario as unit tests (one for each of the stacktrace > above) on the spark code base and made it available as a [pull > request|https://github.com/apache/spark/pull/41688] to this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44132) nesting full outer joins confuses code generator
[ https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44132: Assignee: Steven Aerts > nesting full outer joins confuses code generator > > > Key: SPARK-44132 > URL: https://issues.apache.org/jira/browse/SPARK-44132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 > Environment: We verified the existence of this bug from spark 3.3 > until spark 3.5. >Reporter: Steven Aerts >Assignee: Steven Aerts >Priority: Major > > We are seeing issues with the code generator when querying java bean encoded > data with 2 nested joins. > {code:java} > dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); > {code} > will generate invalid code in the code generator. And can depending on the > data used generate stack traces like: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > Or: > {code:java} > Caused by: java.lang.AssertionError: index (2) should < 2 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > When we look at the generated code we see that the code generator seems to be > mixing up parameters. For example: > {code:java} > if (smj_leftOutputRow_0 != null) { //< null > check for wrong/left parameter > boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes > NPE on right parameter here{code} > It is as if the the nesting of 2 full outer joins is confusing the code > generator and as such generating invalid code. > There is one other strange thing. We found this issue when using data sets > which were using the java bean encoder. We tried to reproduce this in the > spark shell or using scala case classes but were unable to do so. > We made a reproduction scenario as unit tests (one for each of the stacktrace > above) on the spark code base and made it available as a [pull > request|https://github.com/apache/spark/pull/41688] to this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44709) Fix flow control in ExecuteGrpcResponseSender
Juliusz Sompolski created SPARK-44709: - Summary: Fix flow control in ExecuteGrpcResponseSender Key: SPARK-44709 URL: https://issues.apache.org/jira/browse/SPARK-44709 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0, 4.0.0 Reporter: Juliusz Sompolski -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44708) Migrate test_reset_index assert_eq to use assertDataFrameEqual
Amanda Liu created SPARK-44708: -- Summary: Migrate test_reset_index assert_eq to use assertDataFrameEqual Key: SPARK-44708 URL: https://issues.apache.org/jira/browse/SPARK-44708 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44575) Implement Error Translation
[ https://issues.apache.org/jira/browse/SPARK-44575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44575: Assignee: Yihong He > Implement Error Translation > --- > > Key: SPARK-44575 > URL: https://issues.apache.org/jira/browse/SPARK-44575 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44575) Implement Error Translation
[ https://issues.apache.org/jira/browse/SPARK-44575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44575. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42266 [https://github.com/apache/spark/pull/42266] > Implement Error Translation > --- > > Key: SPARK-44575 > URL: https://issues.apache.org/jira/browse/SPARK-44575 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yihong He >Assignee: Yihong He >Priority: Major > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44508) Add user guide for Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44508: - Priority: Major (was: Blocker) > Add user guide for Python UDTFs > --- > > Key: SPARK-44508 > URL: https://issues.apache.org/jira/browse/SPARK-44508 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Add documentation for Python UDTFs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44508) Add user guide for Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751807#comment-17751807 ] Allison Wang commented on SPARK-44508: -- [~holden] Yup you're right. This isn't a blocker since we already have the documentation for the `udtf` function. I've updated the priority. > Add user guide for Python UDTFs > --- > > Key: SPARK-44508 > URL: https://issues.apache.org/jira/browse/SPARK-44508 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > Add documentation for Python UDTFs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43606) Enable IndexesTests.test_index_basic for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43606: Assignee: Haejoon Lee > Enable IndexesTests.test_index_basic for pandas 2.0.0. > -- > > Key: SPARK-43606 > URL: https://issues.apache.org/jira/browse/SPARK-43606 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Enable IndexesTests.test_index_basic for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43606) Enable IndexesTests.test_index_basic for pandas 2.0.0.
[ https://issues.apache.org/jira/browse/SPARK-43606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43606. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42267 [https://github.com/apache/spark/pull/42267] > Enable IndexesTests.test_index_basic for pandas 2.0.0. > -- > > Key: SPARK-43606 > URL: https://issues.apache.org/jira/browse/SPARK-43606 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 4.0.0 > > > Enable IndexesTests.test_index_basic for pandas 2.0.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44707) Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped
[ https://issues.apache.org/jira/browse/SPARK-44707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44707. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42381 [https://github.com/apache/spark/pull/42381] > Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped > -- > > Key: SPARK-44707 > URL: https://issues.apache.org/jira/browse/SPARK-44707 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44707) Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped
[ https://issues.apache.org/jira/browse/SPARK-44707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44707: - Assignee: Dongjoon Hyun > Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped > -- > > Key: SPARK-44707 > URL: https://issues.apache.org/jira/browse/SPARK-44707 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44050) Unable to Mount ConfigMap in Driver Pod - ConfigMap Creation Issue
[ https://issues.apache.org/jira/browse/SPARK-44050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751796#comment-17751796 ] Holden Karau commented on SPARK-44050: -- Ah interesting, it sounds like the fix would be to retry config map creation on failure. > Unable to Mount ConfigMap in Driver Pod - ConfigMap Creation Issue > -- > > Key: SPARK-44050 > URL: https://issues.apache.org/jira/browse/SPARK-44050 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Submit >Affects Versions: 3.3.1 >Reporter: Harshwardhan Singh Dodiya >Priority: Critical > Attachments: image-2023-06-14-11-07-36-960.png > > > Dear Spark community, > I am facing an issue related to mounting a ConfigMap in the driver pod of my > Spark application. Upon investigation, I realized that the problem is caused > by the ConfigMap not being created successfully. > *Problem Description:* > When attempting to mount the ConfigMap in the driver pod, I encounter > consistent failures and my pod stays in containerCreating state. Upon further > investigation, I discovered that the ConfigMap does not exist in the > Kubernetes cluster, which results in the driver pod's inability to access the > required configuration data. > *Additional Information:* > I would like to highlight that this issue is not a frequent occurrence. It > has been observed randomly, affecting the mounting of the ConfigMap in the > driver pod only approximately 5% of the time. This intermittent behavior adds > complexity to the troubleshooting process, as it is challenging to reproduce > consistently. > *Error Message:* > when describing driver pod (kubectl describe pod pod_name) get the below > error. > "ConfigMap '' not found." > *To Reproduce:* > 1. Download spark 3.3.1 from [https://spark.apache.org/downloads.html] > 2. create an image with "bin/docker-image-tool.sh" > 3. Submit on spark-client via bash command by passing all the details and > configurations. > 4. Randomly in some of the driver pod we can observe this issue. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44508) Add user guide for Python UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751795#comment-17751795 ] Holden Karau commented on SPARK-44508: -- I'm not sure this should be a blocker. > Add user guide for Python UDTFs > --- > > Key: SPARK-44508 > URL: https://issues.apache.org/jira/browse/SPARK-44508 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Blocker > > Add documentation for Python UDTFs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44597: --- Description: Migrate tests to new test utils in this file: python/pyspark/pandas/tests/test_sql.py (was: The Jira ticket [SPARK-44042] SPIP: PySpark Test Framework introduces a new PySpark test framework. Some of the user-facing testing util APIs include assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. With the new testing framework, we should migrate old tests in the Spark codebase to use the new testing utils.) > Migrate test_sql assert_eq to use assertDataFrameEqual > -- > > Key: SPARK-44597 > URL: https://issues.apache.org/jira/browse/SPARK-44597 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Migrate tests to new test utils in this file: > python/pyspark/pandas/tests/test_sql.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44589) Migrate PySpark tests to use PySpark built-in test utils
[ https://issues.apache.org/jira/browse/SPARK-44589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44589: --- Description: The Jira ticket SPARK-44042 SPIP: PySpark Test Framework introduces a new PySpark test framework. Some of the user-facing testing util APIs include assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. With the new testing framework, we should migrate old tests in the Spark codebase to use the new testing utils. was:Migrate existing tests in the PySpark codebase to use the new PySpark test utils, outlined here: https://issues.apache.org/jira/browse/SPARK-44042 > Migrate PySpark tests to use PySpark built-in test utils > > > Key: SPARK-44589 > URL: https://issues.apache.org/jira/browse/SPARK-44589 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > The Jira ticket SPARK-44042 SPIP: PySpark Test Framework introduces a new > PySpark test framework. Some of the user-facing testing util APIs include > assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. > With the new testing framework, we should migrate old tests in the Spark > codebase to use the new testing utils. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44561) Fix AssertionError when converting UDTF output to a complex type
[ https://issues.apache.org/jira/browse/SPARK-44561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-44561. --- Fix Version/s: 3.5.0 Assignee: Takuya Ueshin Resolution: Fixed Issue resolved by pull request 42310 https://github.com/apache/spark/pull/42310 > Fix AssertionError when converting UDTF output to a complex type > > > Key: SPARK-44561 > URL: https://issues.apache.org/jira/browse/SPARK-44561 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > > {code:java} > class TestUDTF: > def eval(self): > yield {'a': 1, 'b': 2}, > udtf(TestUDTF, returnType="x: map")().show() {code} > This will fail with: > File "pandas/_libs/lib.pyx", line 2834, in pandas._libs.lib.map_infer > File "python/pyspark/sql/pandas/types.py", line 804, in convert_map > assert isinstance(value, dict) > AssertionError > Same for `convert_struct` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44707) Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped
Dongjoon Hyun created SPARK-44707: - Summary: Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped Key: SPARK-44707 URL: https://issues.apache.org/jira/browse/SPARK-44707 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24282) Add support for PMML export for the Standard Scaler Stage
[ https://issues.apache.org/jira/browse/SPARK-24282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751758#comment-17751758 ] Holden Karau commented on SPARK-24282: -- I don't think were going to do this anymore. > Add support for PMML export for the Standard Scaler Stage > - > > Key: SPARK-24282 > URL: https://issues.apache.org/jira/browse/SPARK-24282 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Trivial > > In anticipation of supporting exporting multiple stages together to PMML it > would be good to support one of the common feature preparation stages, > StandardScaler. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28740) Add support for building with bloop
[ https://issues.apache.org/jira/browse/SPARK-28740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-28740. -- Resolution: Won't Fix > Add support for building with bloop > --- > > Key: SPARK-28740 > URL: https://issues.apache.org/jira/browse/SPARK-28740 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Minor > > bloop can, in theory, build scala faster. However the JAR layout is a little > different when you try and run the tests. It would be useful if we updated > our test JAR discovery to work with bloop. > Before working on this check to make sure that bloop it's self has changed to > work with Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32111) Cleanup locks and docs in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-32111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751754#comment-17751754 ] Holden Karau commented on SPARK-32111: -- I think this could be a good target for Spark 4.0 > Cleanup locks and docs in CoarseGrainedSchedulerBackend > --- > > Key: SPARK-32111 > URL: https://issues.apache.org/jira/browse/SPARK-32111 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0, 4.0.0 >Reporter: Holden Karau >Priority: Trivial > > Our locking is inconsistent inside of CoarseGrainedSchedulerBackend. In > production this doesn't matter because the methods in are called from inside > of ThreadSafeRPC but it would be better to cleanup the locks and docs to be > consistent especially since other schedulers inherit from > CoarseGrainedSchedulerBackend. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32111) Cleanup locks and docs in CoarseGrainedSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-32111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-32111: - Target Version/s: 4.0.0 Affects Version/s: 4.0.0 > Cleanup locks and docs in CoarseGrainedSchedulerBackend > --- > > Key: SPARK-32111 > URL: https://issues.apache.org/jira/browse/SPARK-32111 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0, 4.0.0 >Reporter: Holden Karau >Priority: Trivial > > Our locking is inconsistent inside of CoarseGrainedSchedulerBackend. In > production this doesn't matter because the methods in are called from inside > of ThreadSafeRPC but it would be better to cleanup the locks and docs to be > consistent especially since other schedulers inherit from > CoarseGrainedSchedulerBackend. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38475) Use error classes in org.apache.spark.serializer
[ https://issues.apache.org/jira/browse/SPARK-38475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-38475. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42243 [https://github.com/apache/spark/pull/42243] > Use error classes in org.apache.spark.serializer > > > Key: SPARK-38475 > URL: https://issues.apache.org/jira/browse/SPARK-38475 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38475) Use error classes in org.apache.spark.serializer
[ https://issues.apache.org/jira/browse/SPARK-38475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-38475: Assignee: Bo Zhang > Use error classes in org.apache.spark.serializer > > > Key: SPARK-38475 > URL: https://issues.apache.org/jira/browse/SPARK-38475 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)
[ https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751728#comment-17751728 ] Juan Carlos Blanco Martinez commented on SPARK-44706: - Using [Tuple1|https://www.scala-lang.org/api/2.13.4/scala/Tuple1.html] instead of DataRow serves as a workaround. > ProductEncoder not working as expected within User Defined Fuction (UDF) > > > Key: SPARK-44706 > URL: https://issues.apache.org/jira/browse/SPARK-44706 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: DBR (Databricks) 13.2, Spark 3.4.0 and Scala 2.12. >Reporter: Juan Carlos Blanco Martinez >Priority: Major > > Hi, > When running the following code in Databricks' notebook: > {code:java} > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} > > case class DataRow(field1: String) > val sparkSession = SparkSession.builder.getOrCreate() > import sparkSession.implicits._ > val udf1 = udf((x: String) => { > val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // > This is failing at runtime > 3 > }) > val df1 = Seq(DataRow("test1"), > DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is > working > display(df1) > {code} > I get a ScalaReflectionException at runtime: > {code:java} > ScalaReflectionException: class > $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with > com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad > of type class > com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with > classpath [] and parent being > com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897 > of type class > com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader > with classpath [] and parent being > com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a > of type class > com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with > classpath > [file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/] > and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class > sun.misc.Launcher$AppClassLoader with classpath [...] not found. > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) > at > $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11) > > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) > at > org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55) > > at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) > at > org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302) > > at > org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302) > > at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) > at > $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11) > {code} > Declaring the case class within the UDF: > {code:java} > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} > > case class DataRow(field1: String) > val sparkSession = SparkSession.builder.getOrCreate() > import sparkSession.implicits._ > val udf1 = udf((x: String) => { > case class DataRow(field1: String) > val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // > This is failing at runtime > 3 > }) > val df1 = Seq(DataRow("test1"), > DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is > working > display(df1) {code} > I get a different error > {code:java} > error: value toDF is not a member of Seq[DataRow] val testData = > Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code} > Questions: > * Is this an expected behaviour or a bug? > Thanks. > Thanks. -- This message was sent by Atlassian Jir
[jira] [Resolved] (SPARK-44068) Support positional parameters in Scala connect client
[ https://issues.apache.org/jira/browse/SPARK-44068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-44068. -- Resolution: Duplicate > Support positional parameters in Scala connect client > - > > Key: SPARK-44068 > URL: https://issues.apache.org/jira/browse/SPARK-44068 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Major > > Implement positional parameters of parametrized queries in the Scala connect > client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)
[ https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Carlos Blanco Martinez updated SPARK-44706: Description: Hi, When running the following code in Databricks' notebook: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a ScalaReflectionException at runtime: {code:java} ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad of type class com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897 of type class com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a of type class com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with classpath [file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/] and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class sun.misc.Launcher$AppClassLoader with classpath [...] not found. at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) at org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55) at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11) {code} Declaring the case class within the UDF: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { case class DataRow(field1: String) val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a different error {code:java} error: value toDF is not a member of Seq[DataRow] val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code} Questions: * Is this an expected behaviour or a bug? Thanks. Thanks. was: Hi, When running the following code in Databricks' notebook: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a ScalaReflectionException at runtime: {code:java} ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad of type class com.databricks.backend.daemon.driv
[jira] [Updated] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)
[ https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Carlos Blanco Martinez updated SPARK-44706: Description: Hi, When running the following code in Databricks' notebook: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a ScalaReflectionException at runtime: {code:java} ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad of type class com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897 of type class com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a of type class com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with classpath [file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/] and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class sun.misc.Launcher$AppClassLoader with classpath [...] not found. at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) at org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55) at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11) {code} Declaring the case class within the UDF: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { case class DataRow(field1: String) val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a different error {code:java} error: value toDF is not a member of Seq[DataRow] val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code} Thanks. was: Hi, When running the following code in Databricks' notebook: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a ScalaReflectionException at runtime: {code:java} ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad of type class com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with classpath [] and par
[jira] [Updated] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)
[ https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Carlos Blanco Martinez updated SPARK-44706: Description: Hi, When running the following code in Databricks' notebook: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a ScalaReflectionException at runtime: {code:java} ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad of type class com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897 of type class com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a of type class com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with classpath [file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/] and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class sun.misc.Launcher$AppClassLoader with classpath [...] not found. at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) at org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55) at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11) {code} Declaring the case class within the UDF: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { case class DataRow(field1: String) val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a different error {code:java} error: value toDF is not a member of Seq[DataRow] val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code} Thanks. was: Hi, When running the following code in Databricks' notebook: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a ScalaReflectionException at runtime: {code:java} ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad of type class com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with classpath [] and parent being com.databricks.backend.daemon.
[jira] [Updated] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)
[ https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Carlos Blanco Martinez updated SPARK-44706: Description: Hi, When running the following code in Databricks' notebook: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a ScalaReflectionException at runtime: {code:java} ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad of type class com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897 of type class com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a of type class com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with classpath [file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/] and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class sun.misc.Launcher$AppClassLoader with classpath [...] not found. at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) at org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55) at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11) {code} Declaring the case class within the UDF: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { case class DataRow(field1: String) val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a different error {code:java} error: value toDF is not a member of Seq[DataRow] val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code} Thanks. was: Hi, When running the following code in Databricks' notebook: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a ScalaReflectionException at runtime: {code:java} ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad of type class com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.C
[jira] [Created] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)
Juan Carlos Blanco Martinez created SPARK-44706: --- Summary: ProductEncoder not working as expected within User Defined Fuction (UDF) Key: SPARK-44706 URL: https://issues.apache.org/jira/browse/SPARK-44706 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Environment: DBR (Databricks) 13.2, Spark 3.4.0 and Scala 2.12. Reporter: Juan Carlos Blanco Martinez Hi, When running the following code in Databricks' notebook: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a ScalaReflectionException at runtime: {code:java} ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad of type class com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897 of type class com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with classpath [] and parent being com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a of type class com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with classpath [file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/] and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class sun.misc.Launcher$AppClassLoader with classpath [...] not found. at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) at org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55) at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302) at org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11) {code} Declaring the case class within the UDF: {code:java} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} case class DataRow(field1: String) val sparkSession = SparkSession.builder.getOrCreate() import sparkSession.implicits._ val udf1 = udf((x: String) => { val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This is failing at runtime 3 }) val df1 = Seq(DataRow("test1"), DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is working display(df1) {code} I get a different exception at runtime: {code:java} at $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1$adapted(command-2013905963200886:10) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:223) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1205) ... 111 more{code} Thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44705) Make PythonRunner single-threaded
Utkarsh Agarwal created SPARK-44705: --- Summary: Make PythonRunner single-threaded Key: SPARK-44705 URL: https://issues.apache.org/jira/browse/SPARK-44705 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.5.0 Reporter: Utkarsh Agarwal PythonRunner, a utility that executes Python UDFs in Spark, uses two threads in a producer-consumer model today. This multi-threading model is problematic and confusing as Spark's execution model within a task is commonly understood to be single-threaded. More importantly, this departure of a double-threaded execution resulted in a series of customer issues involving [race conditions|https://issues.apache.org/jira/browse/SPARK-33277] and [deadlocks|https://issues.apache.org/jira/browse/SPARK-38677] between threads as the code was hard to reason about. There have been multiple attempts to reign in these issues, viz., [fix 1|https://issues.apache.org/jira/browse/SPARK-22535], [fix 2|https://github.com/apache/spark/pull/30177], [fix 3|https://github.com/apache/spark/commit/243c321db2f02f6b4d926114bd37a6e74c2be185]. Moreover, the fixes have made the code base somewhat abstruse by introducing multiple daemon [monitor threads|https://github.com/apache/spark/blob/a3a32912be04d3760cb34eb4b79d6d481bbec502/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L579] to detect deadlocks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44634) Encoders.bean does no longer support nested beans with type arguments
[ https://issues.apache.org/jira/browse/SPARK-44634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giambattista Bloisi resolved SPARK-44634. - Fix Version/s: 3.4.2 3.5.0 4.0.0 Resolution: Fixed PR has been merged > Encoders.bean does no longer support nested beans with type arguments > - > > Key: SPARK-44634 > URL: https://issues.apache.org/jira/browse/SPARK-44634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Giambattista Bloisi >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0 > > > Hi, > while upgrading a project from spark 2.4.0 to 3.4.1 version, I have > encountered the same problem described in [java - Encoders.bean attempts to > check the validity of a return type considering its generic type and not its > concrete class, with Spark 3.4.0 - Stack > Overflow|https://stackoverflow.com/questions/76045255/encoders-bean-attempts-to-check-the-validity-of-a-return-type-considering-its-ge]. > Put it short, starting from Spark 3.4.x Encoders.bean throws an exception > when the passed class contains a field whose type is a nested bean with type > arguments: > > {code:java} > class A { >T value; >// value getter and setter > } > class B { >A stringHolder; >// stringHolder getter and setter > } > Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: > [ENCODER_NOT_FOUND]..."{code} > > > It looks like this is a regression introduced with [SPARK-42093 SQL Move > JavaTypeInference to > AgnosticEncoders|https://github.com/apache/spark/commit/18672003513d5a4aa610b6b94dbbc15c33185d3#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b] > while getting rid of TypeToken, that somehow managed that case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44697) Clean up the usage of org.apache.commons.lang3.RandomUtils
[ https://issues.apache.org/jira/browse/SPARK-44697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44697: Assignee: Yang Jie > Clean up the usage of org.apache.commons.lang3.RandomUtils > -- > > Key: SPARK-44697 > URL: https://issues.apache.org/jira/browse/SPARK-44697 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > {code:java} > /** > * Utility library that supplements the standard {@link Random} class. > * > * Caveat: Instances of {@link Random} are not cryptographically > secure. > * > * Please note that the Apache Commons project provides a component > * dedicated to pseudo-random number generation, namely > * https://commons.apache.org/proper/commons-rng/";>Commons RNG, > that may be > * a better choice for applications with more stringent requirements > * (performance and/or correctness). > * > * @deprecated Use Apache Commons RNG's optimized href="https://commons.apache.org/proper/commons-rng/commons-rng-client-api/apidocs/org/apache/commons/rng/UniformRandomProvider.html";>UniformRandomProvider > * @since 3.3 > */ > @Deprecated > public class RandomUtils { {code} > In {{commons-lang3}} 3.13.0, {{RandomUtils}} has been marked as > {{@Deprecated}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44697) Clean up the usage of org.apache.commons.lang3.RandomUtils
[ https://issues.apache.org/jira/browse/SPARK-44697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44697. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42370 [https://github.com/apache/spark/pull/42370] > Clean up the usage of org.apache.commons.lang3.RandomUtils > -- > > Key: SPARK-44697 > URL: https://issues.apache.org/jira/browse/SPARK-44697 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 4.0.0 > > > {code:java} > /** > * Utility library that supplements the standard {@link Random} class. > * > * Caveat: Instances of {@link Random} are not cryptographically > secure. > * > * Please note that the Apache Commons project provides a component > * dedicated to pseudo-random number generation, namely > * https://commons.apache.org/proper/commons-rng/";>Commons RNG, > that may be > * a better choice for applications with more stringent requirements > * (performance and/or correctness). > * > * @deprecated Use Apache Commons RNG's optimized href="https://commons.apache.org/proper/commons-rng/commons-rng-client-api/apidocs/org/apache/commons/rng/UniformRandomProvider.html";>UniformRandomProvider > * @since 3.3 > */ > @Deprecated > public class RandomUtils { {code} > In {{commons-lang3}} 3.13.0, {{RandomUtils}} has been marked as > {{@Deprecated}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44686) Add option to create RowEncoder in Encoders helper class.
[ https://issues.apache.org/jira/browse/SPARK-44686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751671#comment-17751671 ] Herman van Hövell commented on SPARK-44686: --- This has been merged to master/3.5. Preparing backports for 3.4/3.3. > Add option to create RowEncoder in Encoders helper class. > - > > Key: SPARK-44686 > URL: https://issues.apache.org/jira/browse/SPARK-44686 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44704) Cleanup shuffle files from host node after migration due to graceful decommissioning
[ https://issues.apache.org/jira/browse/SPARK-44704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751654#comment-17751654 ] Deependra Patel commented on SPARK-44704: - I will create for this soon. > Cleanup shuffle files from host node after migration due to graceful > decommissioning > > > Key: SPARK-44704 > URL: https://issues.apache.org/jira/browse/SPARK-44704 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.4.1 >Reporter: Deependra Patel >Priority: Minor > > Although these files will be deleted at the end of the application by the > external shuffle service, doing this early can free up resources and can help > in long running applications running out of disk space. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44704) Cleanup shuffle files from host node after migration due to graceful decommissioning
Deependra Patel created SPARK-44704: --- Summary: Cleanup shuffle files from host node after migration due to graceful decommissioning Key: SPARK-44704 URL: https://issues.apache.org/jira/browse/SPARK-44704 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 3.4.1 Reporter: Deependra Patel Although these files will be deleted at the end of the application by the external shuffle service, doing this early can free up resources and can help in long running applications running out of disk space. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44694) Add Default & Active SparkSession for Python Client
[ https://issues.apache.org/jira/browse/SPARK-44694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751634#comment-17751634 ] Nikita Awasthi commented on SPARK-44694: User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/42371 > Add Default & Active SparkSession for Python Client > --- > > Key: SPARK-44694 > URL: https://issues.apache.org/jira/browse/SPARK-44694 > Project: Spark > Issue Type: Task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-43429 for Python side -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44701) Skip ClassificationTestsOnConnect when torch is not installed
[ https://issues.apache.org/jira/browse/SPARK-44701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44701: Assignee: Ruifeng Zheng > Skip ClassificationTestsOnConnect when torch is not installed > - > > Key: SPARK-44701 > URL: https://issues.apache.org/jira/browse/SPARK-44701 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44701) Skip ClassificationTestsOnConnect when torch is not installed
[ https://issues.apache.org/jira/browse/SPARK-44701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44701. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42375 [https://github.com/apache/spark/pull/42375] > Skip ClassificationTestsOnConnect when torch is not installed > - > > Key: SPARK-44701 > URL: https://issues.apache.org/jira/browse/SPARK-44701 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD
[ https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev updated SPARK-20080: --- Description: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd, foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. foreachPartition action not executed. Expected result: Throw java.io.NotSerializableException description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get exception -> {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923) at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12) at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7) at ReproduceBugMain$.main(ReproduceBugMain.scala:14) at ReproduceBugMain.main(ReproduceBugMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Caused by: java.io.NotSerializableException: ReproduceBugMain$ Serialization stack: - object not serializable (class: ReproduceBugMain$, value: ReproduceBugMain$@3935e9a8) - field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: $outer, type: interface TraitWithMethod) - object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295) ... 18 more {code} On Github can be found 2 commit. 1 initial i add link on it(this one contain sptreaming example). and Last one with batch job example was: Step to reproduce: 1)Use SPak Streaming application. 2) inside foreachRDD of any DStream, use rdd,foreachPartition. 3) use org.slf4j.Logger. Init it or use as a filed with closure. inside foreachPartition action. Result: No exception throw. foreachPartition action not executed. Expected result: Throw java.io.NotSerializableException description: When i try use or init org.slf4j.Logger inside foreachPartition, that extracted to trait method. What was called in foreachRDD. I have found that foreachPartition method do not execute and no exception appeared. Tested on local and yarn mode spark. code can be found on [github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17]. There are two main class that explain problem. if i will run same code with batch job. I will get exception -> {code:java} Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDD$$anonfun$for
[jira] [Resolved] (SPARK-31427) Spark Structure streaming read data twice per every micro-batch.
[ https://issues.apache.org/jira/browse/SPARK-31427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev resolved SPARK-31427. Resolution: Invalid > Spark Structure streaming read data twice per every micro-batch. > > > Key: SPARK-31427 > URL: https://issues.apache.org/jira/browse/SPARK-31427 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3, 2.4.7, 3.0.1, 3.0.2 >Reporter: Nick Hryhoriev >Priority: Major > > I have a very strange issue with spark structure streaming. Spark structure > streaming creates two spark jobs for every micro-batch. As a result, read > data from Kafka twice. Here is a simple code snippet. > > {code:java} > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.streaming.Trigger > object CheckHowSparkReadFromKafka { > def main(args: Array[String]): Unit = { > val session = SparkSession.builder() > .config(new SparkConf() > .setAppName(s"simple read from kafka with repartition") > .setMaster("local[*]") > .set("spark.driver.host", "localhost")) > .getOrCreate() > val testPath = "/tmp/spark-test" > FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new > Path(testPath), true) > import session.implicits._ > val stream = session > .readStream > .format("kafka") > .option("kafka.bootstrap.servers","kafka-20002-prod:9092") > .option("subscribe", "topic") > .option("maxOffsetsPerTrigger", 1000) > .option("failOnDataLoss", false) > .option("startingOffsets", "latest") > .load() > .repartitionByRange( $"offset") > .writeStream > .option("path", testPath + "/data") > .option("checkpointLocation", testPath + "/checkpoint") > .format("parquet") > .trigger(Trigger.ProcessingTime(10.seconds)) > .start() > stream.processAllAvailable() > {code} > This happens because if {{.repartitionByRange( $"offset")}}, if I remove this > line, all good. But with spark create two jobs, one with 1 stage just read > from Kafka, the second with 3 stage read -> shuffle -> write. So the result > of the first job never used. > This has a significant impact on performance. Some of my Kafka topics have > 1550 partitions, so read them twice is a big deal. In case I add cache, > things going better, but this is not a way for me. In local mode, the first > job in batch takes less than 0.1 ms, except batch with index 0. But in YARN > cluster and Messos both jobs fully expected and on my topics take near 1.2 > min. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34204) When use input_file_name() func all column from file appeared in physical plan of query, not only projection.
[ https://issues.apache.org/jira/browse/SPARK-34204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev resolved SPARK-34204. Fix Version/s: 3.1.1 Resolution: Fixed > When use input_file_name() func all column from file appeared in physical > plan of query, not only projection. > - > > Key: SPARK-34204 > URL: https://issues.apache.org/jira/browse/SPARK-34204 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: Nick Hryhoriev >Priority: Major > Fix For: 3.1.1 > > > input_file_name() function damage applying projection to the physical plan of > the query. > if use this function and a new column, column-oriented formats like parquet > and orc put all columns to Physical plan. > While without it, only selected columns uploaded. > In my case, performance influence is x30. > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > object TestSize { > def main(args: Array[String]): Unit = { > implicit val spark: SparkSession = SparkSession.builder() > .master("local") > .config("spark.sql.shuffle.partitions", "5") > .getOrCreate() > import spark.implicits._ > val query1 = spark.read.parquet( > "s3a://part-00040-a19f0d20-eab3-48ef-be5a-602c7f9a8e58.c000.gz.parquet" > ) > .select($"app_id", $"idfa", input_file_name().as("fileName")) > .distinct() > .count() >val query2 = spark.read.parquet( > "s3a://part-00040-a19f0d20-eab3-48ef-be5a- 602c7f9a8e58.c000.gz.parquet" ) > .select($"app_id", $"idfa") > .distinct() > .count() > Thread.sleep(100L) > } > } > {code} > `query1` has all columns in the physical plan, while `query2` only two. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35094) Spark from_json(JsonToStruct) function return wrong value in permissive mode in case best effort
[ https://issues.apache.org/jira/browse/SPARK-35094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751630#comment-17751630 ] Nick Hryhoriev commented on SPARK-35094: I miss understand, do the issue resolved? > Spark from_json(JsonToStruct) function return wrong value in permissive mode > in case best effort > - > > Key: SPARK-35094 > URL: https://issues.apache.org/jira/browse/SPARK-35094 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Nick Hryhoriev >Priority: Major > > I use spark 3.1.1 and 3.0.2. > Function `from_json` return wrong result with Permissive mode. > In corner case: > 1. Json message has complex nested structure > \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, > sameNameField_WhichValueWillAppearInwrongPlace}} > 2. Nested -> Nested Field: Schema is satisfy align with value in json. > scala code to reproduce: > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions.from_json > import org.apache.spark.sql.types.IntegerType > import org.apache.spark.sql.types.StringType > import org.apache.spark.sql.types.StructField > import org.apache.spark.sql.types.StructType > object Main { > def main(args: Array[String]): Unit = { > implicit val spark: SparkSession = > SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > val schemaForFieldWhichWillHaveWrongValue = > StructField("problematicName", StringType, nullable = true) > val nestedFieldWhichNotSatisfyJsonMessage = StructField( > "badNestedField", > StructType(Seq(StructField("SomethingWhichNotInJsonMessage", > IntegerType, nullable = true))) > ) > val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage = > StructField( > "nestedField", > StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, > schemaForFieldWhichWillHaveWrongValue)) > ) > val customSchema = StructType(Seq( > schemaForFieldWhichWillHaveWrongValue, > nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage > )) > val jsonStringToTest = > > """{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}""" > val df = List(jsonStringToTest) > .toDF("json") > // issue happen only in permissive mode during best effort > .select(from_json($"json", customSchema).as("toBeFlatten")) > .select("toBeFlatten.*") > df.show(truncate = false) > assert( > df.select("problematicName").as[String].first() == > "ThisValueWillBeOverwritten", > "wrong value in root schema, parser take value from column with same > name but in another nested elvel" > ) > } > } > {code} > I was not able to debug this issue, to find the exact root cause. > But what I find in debug, that In > `org.apache.spark.sql.catalyst.util.FailureSafeParser` in line 64. code block > `e.partialResult()` already have a wrong value. > I hope this will help to fix the issue. > I do a DIRTY HACK to fix the issue. > I just fork this function and hardcode `None` -> `Iterator(toResultRow(None, > e.record))`. > In my case, it's better to do not have any values in the row, than > theoretically have a wrong value in some column. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44703) Log eventLog rewrite duration when compact old event log files
shuyouZZ created SPARK-44703: Summary: Log eventLog rewrite duration when compact old event log files Key: SPARK-44703 URL: https://issues.apache.org/jira/browse/SPARK-44703 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.4.1 Reporter: shuyouZZ Fix For: 3.5.0 When enable {{spark.eventLog.rolling.enabled}} and the number of eventLog files exceeds the value of {{{}spark.history.fs.eventLog.rolling.maxFilesToRetain{}}}, HistoryServer will compact the old event log files into one compact file. Currently there is no log the rewrite duration in {{rewrite}} method, this metric is useful for understand the compact duration, so we need add logs in the method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40497) Upgrade Scala to 2.13.11
[ https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40497: - Fix Version/s: (was: 3.5.0) > Upgrade Scala to 2.13.11 > > > Key: SPARK-40497 > URL: https://issues.apache.org/jira/browse/SPARK-40497 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > We tested and decided to skip the following releases. This issue aims to use > 2.13.11. > - 2022-09-21: v2.13.9 released > [https://github.com/scala/scala/releases/tag/v2.13.9] > - 2022-10-13: 2.13.10 released > [https://github.com/scala/scala/releases/tag/v2.13.10] > > Scala 2.13.11 Milestone > - https://github.com/scala/scala/milestone/100 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40497) Upgrade Scala to 2.13.11
[ https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40497: - Affects Version/s: 4.0.0 (was: 3.4.0) > Upgrade Scala to 2.13.11 > > > Key: SPARK-40497 > URL: https://issues.apache.org/jira/browse/SPARK-40497 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > We tested and decided to skip the following releases. This issue aims to use > 2.13.11. > - 2022-09-21: v2.13.9 released > [https://github.com/scala/scala/releases/tag/v2.13.9] > - 2022-10-13: 2.13.10 released > [https://github.com/scala/scala/releases/tag/v2.13.10] > > Scala 2.13.11 Milestone > - https://github.com/scala/scala/milestone/100 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40497) Upgrade Scala to 2.13.11
[ https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reopened SPARK-40497: -- reopen this due to SPARK-44690 & SPARK-44376 downgrade to 2.13.8, We need to upgrade again after finding a method that can solve the problems described in SPARK-44690 & SPARK-44376 > Upgrade Scala to 2.13.11 > > > Key: SPARK-40497 > URL: https://issues.apache.org/jira/browse/SPARK-40497 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > We tested and decided to skip the following releases. This issue aims to use > 2.13.11. > - 2022-09-21: v2.13.9 released > [https://github.com/scala/scala/releases/tag/v2.13.9] > - 2022-10-13: 2.13.10 released > [https://github.com/scala/scala/releases/tag/v2.13.10] > > Scala 2.13.11 Milestone > - https://github.com/scala/scala/milestone/100 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44702) Cannot derive ExpressionEncoder when using types tagged by a trait
[ https://issues.apache.org/jira/browse/SPARK-44702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751559#comment-17751559 ] Ben Hemsi commented on SPARK-44702: --- This is my first time working on spark. I'm happy to be assigned to the issue, but I don't know how to assign myself to the issue. I've added the affects versions which I've reproduced the issue on, but I think the bug affects all version. > Cannot derive ExpressionEncoder when using types tagged by a trait > -- > > Key: SPARK-44702 > URL: https://issues.apache.org/jira/browse/SPARK-44702 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.4, 3.3.2 >Reporter: Ben Hemsi >Priority: Minor > Labels: spark-sql > > Steps to reproduce: > {code:java} > trait Tag > case class Foo(x: Int) > case class Bar(x: Foo with Tag) > ExpressionEncoder.apply[Bar](){code} > the ExpressionEncoder throws the following error (this was for 3.1.2): > {code:java} > scala.MatchError: Foo with Tag (of class > scala.reflect.internal.Types$RefinedType0) > at > org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931) > at > org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49) > {code} > The bug is [on this line (on > master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461]. > {code:java} > val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code} > which is an incomplete pattern match. The pattern match needs be extended to > include matching on RefinedType as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44702) Cannot derive ExpressionEncoder when using types tagged by a trait
[ https://issues.apache.org/jira/browse/SPARK-44702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Hemsi updated SPARK-44702: -- Priority: Minor (was: Major) > Cannot derive ExpressionEncoder when using types tagged by a trait > -- > > Key: SPARK-44702 > URL: https://issues.apache.org/jira/browse/SPARK-44702 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.4, 3.3.2 >Reporter: Ben Hemsi >Priority: Minor > Labels: spark-sql > > Steps to reproduce: > {code:java} > trait Tag > case class Foo(x: Int) > case class Bar(x: Foo with Tag) > ExpressionEncoder.apply[Bar](){code} > the ExpressionEncoder throws the following error (this was for 3.1.2): > {code:java} > scala.MatchError: Foo with Tag (of class > scala.reflect.internal.Types$RefinedType0) > at > org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931) > at > org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49) > {code} > The bug is [on this line (on > master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461]. > {code:java} > val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code} > which is an incomplete pattern match. The pattern match needs be extended to > include matching on RefinedType as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751555#comment-17751555 ] ASF GitHub Bot commented on SPARK-44700: User 'monkeyboy123' has created a pull request for this issue: https://github.com/apache/spark/pull/42376 > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: jiahong.li >Priority: Minor > > _SQL_ like below: > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} includes more than 100 fields. > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44702) Cannot derive ExpressionEncoder when using types tagged by a trait
[ https://issues.apache.org/jira/browse/SPARK-44702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Hemsi updated SPARK-44702: -- Description: Steps to reproduce: {code:java} trait Tag case class Foo(x: Int) case class Bar(x: Foo with Tag) ExpressionEncoder.apply[Bar](){code} the ExpressionEncoder throws the following error (this was for 3.1.2): {code:java} scala.MatchError: Foo with Tag (of class scala.reflect.internal.Types$RefinedType0) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928) at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49) {code} The bug is [on this line (on master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461]. {code:java} val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code} which is an incomplete pattern match. The pattern match needs be extended to include matching on RefinedType as well. was: Steps to reproduce: {code:java} trait Tag case class Foo(x: Int) case class Bar(x: Foo with Tag) ExpressionEncoder.apply[Bar](){code} the ExpressionEncoder throws the following error (this was for 3.1.2): {code:java} scala.MatchError: Foo with Tag (of class scala.reflect.internal.Types$RefinedType0) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928) at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49) {code} The bug is [on this line (on master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461]. {code:java} val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code} which is an incomplete pattern match. The pattern match needs be extended to include matching on RefinedType as well. > Cannot derive ExpressionEncoder when using types tagged by a trait > -- > > Key: SPARK-44702 > URL: https://issues.apache.org/jira/browse/SPARK-44702 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.4, 3.3.2 >Reporter: Ben Hemsi >Priority: Major > Labels: spark-sql > > Steps to reproduce: > {code:java} > trait Tag > case class Foo(x: Int) > case class Bar(x: Foo with Tag) > ExpressionEncoder.apply[Bar](){code} > the ExpressionEncoder throws the following error (this was for 3.1.2): > {code:java} > scala.MatchError: Foo with Tag (of class > scala.reflect.internal.Types$RefinedType0) > at > org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931) > at > org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49) > {code} > The bug is [on this line (on > master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461]. > {code:java} > val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code} > which is an incomplete pattern match. The pattern match needs be extended to > include matching on RefinedType as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44702) Cannot derive ExpressionEncoder when using types tagged by a trait
Ben Hemsi created SPARK-44702: - Summary: Cannot derive ExpressionEncoder when using types tagged by a trait Key: SPARK-44702 URL: https://issues.apache.org/jira/browse/SPARK-44702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2, 3.2.4, 3.1.3 Reporter: Ben Hemsi Steps to reproduce: {code:java} trait Tag case class Foo(x: Int) case class Bar(x: Foo with Tag) ExpressionEncoder.apply[Bar](){code} the ExpressionEncoder throws the following error (this was for 3.1.2): {code:java} scala.MatchError: Foo with Tag (of class scala.reflect.internal.Types$RefinedType0) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931) at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928) at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49) {code} The bug is [on this line (on master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461]. {code:java} val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code} which is an incomplete pattern match. The pattern match needs be extended to include matching on RefinedType as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44669) Parquet/ORC files written using Hive Serde should has file extension
[ https://issues.apache.org/jira/browse/SPARK-44669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751548#comment-17751548 ] ASF GitHub Bot commented on SPARK-44669: User 'pan3793' has created a pull request for this issue: https://github.com/apache/spark/pull/42336 > Parquet/ORC files written using Hive Serde should has file extension > > > Key: SPARK-44669 > URL: https://issues.apache.org/jira/browse/SPARK-44669 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-44700: --- Description: _SQL_ like below: select tmp.* from (select device_id, ads_id, from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', '"user_device_'), ${device_schema}) as tmp from input ) ${device_schema} includes more than 100 fields. if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, will be invoked many times, that costs so much time. was: _SQL_ like below: select tmp.* from (select device_id, ads_id, from_json(regexp_replace(device_personas, '(?<=(\{|,))"device_', '"user_device_'), ${device_schema}) as tmp from input ) ${device_schema} has more than 100 fields. if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, will be invoked many times, that costs so much time. > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: jiahong.li >Priority: Minor > > _SQL_ like below: > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} includes more than 100 fields. > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-44700: --- Description: _SQL_ like below: select tmp.* from (select device_id, ads_id, from_json(regexp_replace(device_personas, '(?<=(\{|,))"device_', '"user_device_'), ${device_schema}) as tmp from input ) ${device_schema} has more than 100 fields. if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, will be invoked many times, that costs so much time. was: _SQL_ like below: ``` select tmp.* from (select device_id, ads_id, from_json(regexp_replace(device_personas, '(?<=({|,))"device_', '"user_device_'), ${device_schema}) as tmp from input ) ${device_schema} has more than 100 fields ``` if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, will be invoked many times, that costs so much time. > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: jiahong.li >Priority: Minor > > _SQL_ like below: > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} has more than 100 fields. > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-44700: --- Description: _SQL_ like below: ``` select tmp.* from (select device_id, ads_id, from_json(regexp_replace(device_personas, '(?<=({|,))"device_', '"user_device_'), ${device_schema}) as tmp from input ) ${device_schema} has more than 100 fields ``` if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, will be invoked many times, that costs so much time. was: _SQL_ like below: ``` select tmp.* from (select device_id, ads_id, from_json(regexp_replace(device_personas, '(?<=(\\\{|,))"device_', '"user_device_'), ${device_schema}) as tmp from input ) ${device_schema} has more than 100 fields ``` if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, will be invoked many times, that costs so much time > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: jiahong.li >Priority: Minor > > _SQL_ like below: > ``` > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=({|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} has more than 100 fields > ``` > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-44700: --- Description: _SQL_ like below: ``` select tmp.* from (select device_id, ads_id, from_json(regexp_replace(device_personas, '(?<=(\\\{|,))"device_', '"user_device_'), ${device_schema}) as tmp from input ) ${device_schema} has more than 100 fields ``` if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, will be invoked many times, that costs so much time > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: jiahong.li >Priority: Minor > > _SQL_ like below: > ``` > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\\\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} has more than 100 fields > ``` > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44701) Skip ClassificationTestsOnConnect when torch is not installed
Ruifeng Zheng created SPARK-44701: - Summary: Skip ClassificationTestsOnConnect when torch is not installed Key: SPARK-44701 URL: https://issues.apache.org/jira/browse/SPARK-44701 Project: Spark Issue Type: Test Components: Tests Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-44700: --- Component/s: SQL > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: jiahong.li >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-44700: --- Component/s: (was: Spark Core) > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.4.1 >Reporter: jiahong.li >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
jiahong.li created SPARK-44700: -- Summary: Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace) Key: SPARK-44700 URL: https://issues.apache.org/jira/browse/SPARK-44700 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1, 3.4.0 Reporter: jiahong.li -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44698) Create table like other table should also copy table stats.
[ https://issues.apache.org/jira/browse/SPARK-44698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Zhu resolved SPARK-44698. Resolution: Not A Problem Sorry i misunderstand, the create table like don't need to copy data actually! > Create table like other table should also copy table stats. > --- > > Key: SPARK-44698 > URL: https://issues.apache.org/jira/browse/SPARK-44698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 4.0.0 >Reporter: Qi Zhu >Priority: Major > > For example: > describe table extended tbl; > col0 int > col1 int > col2 int > col3 int > Detailed Table Information > Catalog spark_catalog > Database default > Table tbl > Owner zhuqi > Created Time Mon Aug 07 14:02:30 CST 2023 > Last Access UNKNOWN > Created By Spark 4.0.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1691388473] > Statistics 30 bytes > Location > [file:/Users/zhuqi/spark/spark/spark-warehouse/tbl|file:///Users/zhuqi/spark/spark/spark-warehouse/tbl] > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties [serialization.format=1] > Partition Provider Catalog > Time taken: 0.032 seconds, Fetched 23 row(s) > create table tbl2 like tbl; > 23/08/07 14:14:07 WARN HiveMetaStore: Location: > [file:/Users/zhuqi/spark/spark/spark-warehouse/tbl2|file:///Users/zhuqi/spark/spark/spark-warehouse/tbl2] > specified for non-external table:tbl2 > Time taken: 0.098 seconds > spark-sql (default)> describe table extended tbl2; > col0 int > col1 int > col2 int > col3 int > Detailed Table Information > Catalog spark_catalog > Database default > Table tbl2 > Owner zhuqi > Created Time Mon Aug 07 14:14:07 CST 2023 > Last Access UNKNOWN > Created By Spark 4.0.0-SNAPSHOT > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1691388847] > Location > [file:/Users/zhuqi/spark/spark/spark-warehouse/tbl2|file:///Users/zhuqi/spark/spark/spark-warehouse/tbl2] > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Storage Properties [serialization.format=1] > Partition Provider Catalog > Time taken: 0.03 seconds, Fetched 22 row(s) > The table stats are missing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org