[jira] [Commented] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751899#comment-17751899
 ] 

Hyukjin Kwon commented on SPARK-44717:
--

I think we should at least make this respects {{spark.sql.timestampType}} when 
it;s set to {{TIMESTAMP_NTZ}}. Took a quick look, and I think there's some 
problem in calculation logic in 
https://github.com/apache/spark/blob/master/python/pyspark/pandas/resample.py#L277-L310
 especially date_trunc returns always {{TimestampType}}. We might need a 
dedicated internal expression cc [~podongfeng]

> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> --
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York (like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat})
> You will get the error for the latter mentioned test:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> >>> a+b+b
> Timestamp('2011-03-13 03:00:00')
> {noformat}
> But pyspark TimestampType uses TZ and DST:
> {noformat}
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
> +---+
> |TIMESTAMP '2011-03-13 01:00:00'|
> +---+
> |2011-03-13 01:00:00|
> +---+
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
> >>> make_interval(0,0,0,0,1,0,0)").show()
> ++
> |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
> 

[jira] [Updated] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-07 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-44717:
---
Description: 
Use one of the existing test:
- "11H" case of test_dataframe_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 
- "1001H" case of test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 

After setting the TZ for example to New York (like by using the following 
python code in a "setUpClass":  
{noformat}
os.environ["TZ"] = 'America/New_York'
{noformat})

You will get the error for the latter mentioned test:

{noformat}
==
FAIL [4.219s]: test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
276, in test_series_resample
self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", 
"sum")
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
259, in _test_resample
self.assert_eq(
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
assert_eq
_assert_pandas_almost_equal(lobj, robj)
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
_assert_pandas_almost_equal
raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] 
Series are not almost equal:
Left:
Freq: 1001H
float64
Right:
float64
{noformat}

The problem is the in the pyspark resample there will be more resampled rows in 
the result. The DST change will cause those extra lines as the computed 
__tmp_resample_bin_col__ be something like:

{noformat}
| __index_level_0__  | __tmp_resample_bin_col__ | A
.
|2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
|2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
|2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
|2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
|2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
|2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
|2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
|2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
|2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
|2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
|2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
{noformat}

You can see the extra lines around when the DST kicked in on 2011-03-13 in New 
York.

Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help.

You can see my tests here:
https://github.com/attilapiros/spark/pull/5

Pandas timestamps are TZ less:
`
{noformat}
import pandas as pd
a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
b = pd.Timedelta(hours=1)

>> a 
Timestamp('2011-03-13 01:00:00')
>>> a+b
Timestamp('2011-03-13 02:00:00')
>>> a+b+b
Timestamp('2011-03-13 03:00:00')
{noformat}

But pyspark TimestampType uses TZ and DST:

{noformat}
>>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
+---+
|TIMESTAMP '2011-03-13 01:00:00'|
+---+
|2011-03-13 01:00:00|
+---+

>>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()
++
|TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 03:00:00|
++
{noformat}

The current resample code uses the above interval based calculation.


  was:
Use one of the existing test:
- "11H" case of test_dataframe_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 
- "1001H" case of test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 

After setting the TZ for example to New York. Like by using the following 
python code in a "setUpClass":  
{noformat}
os.environ["TZ"] = 'America/New_York'
{noformat}

You will get the error for the latter one:

{noformat}
==
FAIL [4.219s]: test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
276, in test_series_resample
self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", 
"sum")
  File 

[jira] [Commented] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751887#comment-17751887
 ] 

Hyukjin Kwon commented on SPARK-44717:
--

cc [~podongfeng]

> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> --
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York. Like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat}
> You will get the error for the latter one:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> >>> a+b+b
> Timestamp('2011-03-13 03:00:00')
> {noformat}
> But pyspark TimestampType uses TZ and DST:
> {noformat}
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
> +---+
> |TIMESTAMP '2011-03-13 01:00:00'|
> +---+
> |2011-03-13 01:00:00|
> +---+
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
> >>> make_interval(0,0,0,0,1,0,0)").show()
> ++
> |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
> ++
> | 2011-03-13 03:00:00|
> ++
> {noformat}
> The current resample code uses the above interval based calculation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, 

[jira] [Updated] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

2023-08-07 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-44717:
---
Summary: "pyspark.pandas.resample" is incorrect when DST is overlapped and 
setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help  (was: 
"pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
"spark.sql.timestampType" to TIMESTAMP_NTZ does not applied)

> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> --
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York. Like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat}
> You will get the error for the latter one:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> >>> a+b+b
> Timestamp('2011-03-13 03:00:00')
> {noformat}
> But pyspark TimestampType uses TZ and DST:
> {noformat}
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
> +---+
> |TIMESTAMP '2011-03-13 01:00:00'|
> +---+
> |2011-03-13 01:00:00|
> +---+
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
> >>> make_interval(0,0,0,0,1,0,0)").show()
> ++
> |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
> ++
> | 2011-03-13 03:00:00|
> 

[jira] [Commented] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied

2023-08-07 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751886#comment-17751886
 ] 

Attila Zsolt Piros commented on SPARK-44717:


cc [~itholic] [~gurwls223]


> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied
> -
>
> Key: SPARK-44717
> URL: https://issues.apache.org/jira/browse/SPARK-44717
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0, 3.4.1, 4.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York. Like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat}
> You will get the error for the latter one:
> {noformat}
> ==
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
> self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
> self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
> _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
> raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__.| __tmp_resample_bin_col__ | A
> .
> |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> >>> a+b+b
> Timestamp('2011-03-13 03:00:00')
> {noformat}
> But pyspark TimestampType uses TZ and DST:
> {noformat}
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
> +---+
> |TIMESTAMP '2011-03-13 01:00:00'|
> +---+
> |2011-03-13 01:00:00|
> +---+
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
> >>> make_interval(0,0,0,0,1,0,0)").show()
> ++
> |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
> ++
> | 2011-03-13 03:00:00|
> ++
> {noformat}
> The current resample code uses the above interval based calculation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied

2023-08-07 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-44717:
---
Description: 
Use one of the existing test:
- "11H" case of test_dataframe_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 
- "1001H" case of test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 

After setting the TZ for example to New York. Like by using the following 
python code in a "setUpClass":  
{noformat}
os.environ["TZ"] = 'America/New_York'
{noformat}

You will get the error for the latter one:

{noformat}
==
FAIL [4.219s]: test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
276, in test_series_resample
self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", 
"sum")
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
259, in _test_resample
self.assert_eq(
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
assert_eq
_assert_pandas_almost_equal(lobj, robj)
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
_assert_pandas_almost_equal
raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] 
Series are not almost equal:
Left:
Freq: 1001H
float64
Right:
float64
{noformat}

The problem is the in the pyspark resample there will be more resampled rows in 
the result. The DST change will cause those extra lines as the computed 
__tmp_resample_bin_col__ be something like:

{noformat}
| __index_level_0__  | __tmp_resample_bin_col__ | A
.
|2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
|2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
|2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
|2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
|2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
|2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
|2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
|2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
|2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
|2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
|2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
{noformat}

You can see the extra lines around when the DST kicked in on 2011-03-13 in New 
York.

Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help.

You can see my tests here:
https://github.com/attilapiros/spark/pull/5

Pandas timestamps are TZ less:
`
{noformat}
import pandas as pd
a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
b = pd.Timedelta(hours=1)

>> a 
Timestamp('2011-03-13 01:00:00')
>>> a+b
Timestamp('2011-03-13 02:00:00')
>>> a+b+b
Timestamp('2011-03-13 03:00:00')
{noformat}

But pyspark TimestampType uses TZ and DST:

{noformat}
>>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
+---+
|TIMESTAMP '2011-03-13 01:00:00'|
+---+
|2011-03-13 01:00:00|
+---+

>>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()
++
|TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 03:00:00|
++
{noformat}

The current resample code uses the above interval based calculation.


  was:
Use one of the existing test:
- "11H" case of test_dataframe_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 
- "1001H" case of test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 

After setting the TZ for example to New York. Like by using the following 
python code in a "setUpClass":  
{noformat}
os.environ["TZ"] = 'America/New_York'
{noformat}

You will get the error for the latter one:

{noformat}
==
FAIL [4.219s]: test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
276, in test_series_resample
self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", 
"sum")
  File 

[jira] [Updated] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied

2023-08-07 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-44717:
---
Description: 
Use one of the existing test:
- "11H" case of test_dataframe_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 
- "1001H" case of test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 

After setting the TZ for example to New York. Like by using the following 
python code in a "setUpClass":  
{noformat}
os.environ["TZ"] = 'America/New_York'
{noformat}

You will get the error for the latter one:

{noformat}
==
FAIL [4.219s]: test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
276, in test_series_resample
self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", 
"sum")
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
259, in _test_resample
self.assert_eq(
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
assert_eq
_assert_pandas_almost_equal(lobj, robj)
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
_assert_pandas_almost_equal
raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] 
Series are not almost equal:
Left:
Freq: 1001H
float64
Right:
float64
{noformat}

The problem is the in the pyspark resample there will be more resampled rows in 
the result. The DST change will cause those extra lines as the computed 
__tmp_resample_bin_col__ be something like:

{noformat}
| __index_level_0__.| __tmp_resample_bin_col__ | A
.
|2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
|2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
|2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
|2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
|2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
|2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
|2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
|2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
|2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
|2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
|2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
{noformat}

You can see the extra lines around when the DST kicked in on 2011-03-13 in New 
York.

Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help.

You can see my tests here:
https://github.com/attilapiros/spark/pull/5

Pandas timestamps are TZ less:
`
{noformat}
import pandas as pd
a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
b = pd.Timedelta(hours=1)

>> a 
Timestamp('2011-03-13 01:00:00')
>>> a+b
Timestamp('2011-03-13 02:00:00')
>>> a+b+b
Timestamp('2011-03-13 03:00:00')
{noformat}

But pyspark TimestampType uses TZ and DST:

{noformat}
>>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
+---+
|TIMESTAMP '2011-03-13 01:00:00'|
+---+
|2011-03-13 01:00:00|
+---+

>>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()
++
|TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 03:00:00|
++
{noformat}

The current resample code uses the above interval based calculation.


  was:
Use one of the existing test:
- "11H" case of test_dataframe_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 
-"1001H" case of test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 

After setting the TZ for example to New York. Like by using the following 
python code in a "setUpClass":  
{noformat}
os.environ["TZ"] = 'America/New_York'
{noformat}

You will get the error for the latter one:

{noformat}
==
FAIL [4.219s]: test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
276, in test_series_resample
self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", 
"sum")
  File 

[jira] [Created] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied

2023-08-07 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-44717:
--

 Summary: "pyspark.pandas.resample" is incorrect when DST is 
overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not 
applied
 Key: SPARK-44717
 URL: https://issues.apache.org/jira/browse/SPARK-44717
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.1, 3.4.0, 4.0.0
Reporter: Attila Zsolt Piros


Use one of the existing test:
- "11H" case of test_dataframe_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 
-"1001H" case of test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 

After setting the TZ for example to New York. Like by using the following 
python code in a "setUpClass":  
{noformat}
os.environ["TZ"] = 'America/New_York'
{noformat}

You will get the error for the latter one:

{noformat}
==
FAIL [4.219s]: test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
276, in test_series_resample
self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", 
"sum")
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
259, in _test_resample
self.assert_eq(
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
assert_eq
_assert_pandas_almost_equal(lobj, robj)
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
_assert_pandas_almost_equal
raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] 
Series are not almost equal:
Left:
Freq: 1001H
float64
Right:
float64
{noformat}

The problem is the in the pyspark resample there will be more resampled rows in 
the result. The DST change will cause those extra lines as the computed 
__tmp_resample_bin_col__ be something like:

{noformat}
| __index_level_0__.| __tmp_resample_bin_col__ | A
.
|2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919  |
|2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046  |
|2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365  |
|2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 |
|2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799  |
|2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284  |
|2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166  |
|2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583  |
|2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088  |
|2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812|
|2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563  |
{noformat}

You can see the extra lines around when the DST kicked in on 2011-03-13 in New 
York.

Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help.

You can see my tests here:
https://github.com/attilapiros/spark/pull/5

Pandas timestamps are TZ less:
`
{noformat}
import pandas as pd
a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
b = pd.Timedelta(hours=1)

>> a 
Timestamp('2011-03-13 01:00:00')
>>> a+b
Timestamp('2011-03-13 02:00:00')
>>> a+b+b
Timestamp('2011-03-13 03:00:00')
{noformat}

But pyspark TimestampType uses TZ and DST:

{noformat}
>>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
+---+
|TIMESTAMP '2011-03-13 01:00:00'|
+---+
|2011-03-13 01:00:00|
+---+

>>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()
++
|TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
++
| 2011-03-13 03:00:00|
++
{noformat}

The current resample code uses the above interval based calculation.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-44716) Add missing callUDF in Python Spark Connect client

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon deleted SPARK-44716:
-


> Add missing callUDF in Python Spark Connect client
> --
>
> Key: SPARK-44716
> URL: https://issues.apache.org/jira/browse/SPARK-44716
> Project: Spark
>  Issue Type: Task
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See also SPARK-44715



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44716) Add missing callUDF in Python Spark Connect client

2023-08-07 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-44716:


 Summary: Add missing callUDF in Python Spark Connect client
 Key: SPARK-44716
 URL: https://issues.apache.org/jira/browse/SPARK-44716
 Project: Spark
  Issue Type: Task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Hyukjin Kwon


See also SPARK-44715



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44703) Log eventLog rewrite duration when compact old event log files

2023-08-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-44703.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42378
[https://github.com/apache/spark/pull/42378]

> Log eventLog rewrite duration when compact old event log files
> --
>
> Key: SPARK-44703
> URL: https://issues.apache.org/jira/browse/SPARK-44703
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Major
> Fix For: 4.0.0
>
>
> When enable {{spark.eventLog.rolling.enabled}} and the number of eventLog 
> files exceeds the value of 
> {{{}spark.history.fs.eventLog.rolling.maxFilesToRetain{}}},
> HistoryServer will compact the old event log files into one compact file.
> Currently there is no log the rewrite duration in {{rewrite}} method, this 
> metric is useful for understand the compact duration, so we need add logs in 
> the method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44703) Log eventLog rewrite duration when compact old event log files

2023-08-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-44703:


Assignee: shuyouZZ

> Log eventLog rewrite duration when compact old event log files
> --
>
> Key: SPARK-44703
> URL: https://issues.apache.org/jira/browse/SPARK-44703
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Major
>
> When enable {{spark.eventLog.rolling.enabled}} and the number of eventLog 
> files exceeds the value of 
> {{{}spark.history.fs.eventLog.rolling.maxFilesToRetain{}}},
> HistoryServer will compact the old event log files into one compact file.
> Currently there is no log the rewrite duration in {{rewrite}} method, this 
> metric is useful for understand the compact duration, so we need add logs in 
> the method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44703) Log eventLog rewrite duration when compact old event log files

2023-08-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-44703:
-
Fix Version/s: (was: 3.5.0)

> Log eventLog rewrite duration when compact old event log files
> --
>
> Key: SPARK-44703
> URL: https://issues.apache.org/jira/browse/SPARK-44703
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: shuyouZZ
>Priority: Major
>
> When enable {{spark.eventLog.rolling.enabled}} and the number of eventLog 
> files exceeds the value of 
> {{{}spark.history.fs.eventLog.rolling.maxFilesToRetain{}}},
> HistoryServer will compact the old event log files into one compact file.
> Currently there is no log the rewrite duration in {{rewrite}} method, this 
> metric is useful for understand the compact duration, so we need add logs in 
> the method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44005) Improve error messages for regular Python UDTFs that return non-tuple values

2023-08-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44005.
---
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42353
[https://github.com/apache/spark/pull/42353]

> Improve error messages for regular Python UDTFs that return non-tuple values
> 
>
> Key: SPARK-44005
> URL: https://issues.apache.org/jira/browse/SPARK-44005
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Currently, if you have a UDTF like this:
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield a {code}
> and run the UDTF, it will fail with a confusing error message like
> {code:java}
> Unexpected tuple 1 with StructType {code}
> Note this works when arrow is enabled. We should improve error messages for 
> regular UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44005) Improve error messages for regular Python UDTFs that return non-tuple values

2023-08-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44005:
-

Assignee: Allison Wang

> Improve error messages for regular Python UDTFs that return non-tuple values
> 
>
> Key: SPARK-44005
> URL: https://issues.apache.org/jira/browse/SPARK-44005
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Currently, if you have a UDTF like this:
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield a {code}
> and run the UDTF, it will fail with a confusing error message like
> {code:java}
> Unexpected tuple 1 with StructType {code}
> Note this works when arrow is enabled. We should improve error messages for 
> regular UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44689) `UDFClassLoadingE2ESuite` failed in the daily test for Java 17.

2023-08-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44689.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42360
[https://github.com/apache/spark/pull/42360]

> `UDFClassLoadingE2ESuite` failed in the daily test for Java 17.
> ---
>
> Key: SPARK-44689
> URL: https://issues.apache.org/jira/browse/SPARK-44689
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> [https://github.com/apache/spark/actions/runs/5766913899/job/15635782831]
>  
> {code:java}
> [info] - update class loader after stubbing: new session *** FAILED *** (101 
> milliseconds)
> 6233[info]   "Exception in SerializedLambda.readResolve" did not contain 
> "java.lang.NoSuchMethodException: 
> org.apache.spark.sql.connect.client.StubClassDummyUdf" 
> (UDFClassLoadingE2ESuite.scala:57)
> 6234[info]   org.scalatest.exceptions.TestFailedException:
> 6235[info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> 6236[info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> 6237[info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> 6238[info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> 6239[info]   at 
> org.apache.spark.sql.connect.client.UDFClassLoadingE2ESuite.$anonfun$new$1(UDFClassLoadingE2ESuite.scala:57)
> 6240[info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 6241[info]   at 
> org.apache.spark.sql.connect.client.util.RemoteSparkSession.$anonfun$test$1(RemoteSparkSession.scala:243)
> 6242[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> 6243[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> 6244[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> 6245[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> 6246[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> 6247[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> 6248[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
> 6249[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
> 6250[info]   at 
> org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
> 6251[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> 6252[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> 6253[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> 6254[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> 6255[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> 6256[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
> 6257[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> 6258[info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> 6259[info]   at scala.collection.immutable.List.foreach(List.scala:431)
> 6260[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> 6261[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> 6262[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> 6263[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> 6264[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> 6265[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
> 6266[info]   at org.scalatest.Suite.run(Suite.scala:1114)
> 6267[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
> 6268[info]   at 
> org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
> 6269[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
> 6270[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
> 6271[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
> 6272[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
> 6273[info]   at 
> org.apache.spark.sql.connect.client.UDFClassLoadingE2ESuite.org$scalatest$BeforeAndAfterAll$$super$run(UDFClassLoadingE2ESuite.scala:29)
> 6274[info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> 6275[info]   at 
> 

[jira] [Assigned] (SPARK-44689) `UDFClassLoadingE2ESuite` failed in the daily test for Java 17.

2023-08-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44689:


Assignee: Yang Jie

> `UDFClassLoadingE2ESuite` failed in the daily test for Java 17.
> ---
>
> Key: SPARK-44689
> URL: https://issues.apache.org/jira/browse/SPARK-44689
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> [https://github.com/apache/spark/actions/runs/5766913899/job/15635782831]
>  
> {code:java}
> [info] - update class loader after stubbing: new session *** FAILED *** (101 
> milliseconds)
> 6233[info]   "Exception in SerializedLambda.readResolve" did not contain 
> "java.lang.NoSuchMethodException: 
> org.apache.spark.sql.connect.client.StubClassDummyUdf" 
> (UDFClassLoadingE2ESuite.scala:57)
> 6234[info]   org.scalatest.exceptions.TestFailedException:
> 6235[info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> 6236[info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> 6237[info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> 6238[info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> 6239[info]   at 
> org.apache.spark.sql.connect.client.UDFClassLoadingE2ESuite.$anonfun$new$1(UDFClassLoadingE2ESuite.scala:57)
> 6240[info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 6241[info]   at 
> org.apache.spark.sql.connect.client.util.RemoteSparkSession.$anonfun$test$1(RemoteSparkSession.scala:243)
> 6242[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> 6243[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> 6244[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> 6245[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> 6246[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> 6247[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> 6248[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
> 6249[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
> 6250[info]   at 
> org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
> 6251[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> 6252[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> 6253[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> 6254[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> 6255[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> 6256[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
> 6257[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> 6258[info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> 6259[info]   at scala.collection.immutable.List.foreach(List.scala:431)
> 6260[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> 6261[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> 6262[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> 6263[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> 6264[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> 6265[info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
> 6266[info]   at org.scalatest.Suite.run(Suite.scala:1114)
> 6267[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
> 6268[info]   at 
> org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
> 6269[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
> 6270[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
> 6271[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
> 6272[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
> 6273[info]   at 
> org.apache.spark.sql.connect.client.UDFClassLoadingE2ESuite.org$scalatest$BeforeAndAfterAll$$super$run(UDFClassLoadingE2ESuite.scala:29)
> 6274[info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> 6275[info]   at 
> org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> 6276[info]   at 
> org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> 6277[info]   

[jira] [Resolved] (SPARK-44554) Install different Python linter dependencies for daily testing of different Spark versions

2023-08-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44554.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42167
[https://github.com/apache/spark/pull/42167]

> Install different Python linter dependencies for daily testing of different 
> Spark versions
> --
>
> Key: SPARK-44554
> URL: https://issues.apache.org/jira/browse/SPARK-44554
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 4.0.0
>
>
> Fix daily test python lint check failure for branches 3.3 and 3.4
>  
> 3.4 : 
> [https://github.com/apache/spark/actions/runs/5654787844/job/15318633266]
> 3.3 : https://github.com/apache/spark/actions/runs/5653655970/job/15315236052



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44554) Install different Python linter dependencies for daily testing of different Spark versions

2023-08-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44554:


Assignee: Yang Jie

> Install different Python linter dependencies for daily testing of different 
> Spark versions
> --
>
> Key: SPARK-44554
> URL: https://issues.apache.org/jira/browse/SPARK-44554
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> Fix daily test python lint check failure for branches 3.3 and 3.4
>  
> 3.4 : 
> [https://github.com/apache/spark/actions/runs/5654787844/job/15318633266]
> 3.3 : https://github.com/apache/spark/actions/runs/5653655970/job/15315236052



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44641) SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but conditions unmet

2023-08-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-44641.
--
Fix Version/s: 3.4.2
   3.5.0
 Assignee: Chao Sun
   Resolution: Fixed

> SPJ: Results duplicated when SPJ partial-cluster and pushdown enabled but 
> conditions unmet
> --
>
> Key: SPARK-44641
> URL: https://issues.apache.org/jira/browse/SPARK-44641
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Szehon Ho
>Assignee: Chao Sun
>Priority: Blocker
> Fix For: 3.4.2, 3.5.0
>
>
> Adding the following test case in KeyGroupedPartitionSuite demonstrates the 
> problem.
>  
> {code:java}
> test("test join key is the second partition key and a transform") {
>   val items_partitions = Array(bucket(8, "id"), days("arrive_time"))
>   createTable(items, items_schema, items_partitions)
>   sql(s"INSERT INTO testcat.ns.$items VALUES " +
> s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 'aa', 41.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 'bb', 10.0, cast('2020-01-01' as timestamp)), " +
> s"(2, 'bb', 10.5, cast('2020-01-01' as timestamp)), " +
> s"(3, 'cc', 15.5, cast('2020-02-01' as timestamp))")
>   val purchases_partitions = Array(bucket(8, "item_id"), days("time"))
>   createTable(purchases, purchases_schema, purchases_partitions)
>   sql(s"INSERT INTO testcat.ns.$purchases VALUES " +
> s"(1, 42.0, cast('2020-01-01' as timestamp)), " +
> s"(1, 44.0, cast('2020-01-15' as timestamp)), " +
> s"(1, 45.0, cast('2020-01-15' as timestamp)), " +
> s"(2, 11.0, cast('2020-01-01' as timestamp)), " +
> s"(3, 19.5, cast('2020-02-01' as timestamp))")
>   withSQLConf(
> SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_CO_PARTITION.key -> "false",
> SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true",
> SQLConf.V2_BUCKETING_PARTIALLY_CLUSTERED_DISTRIBUTION_ENABLED.key ->
>   "true") {
> val df = sql("SELECT id, name, i.price as purchase_price, " +
>   "p.item_id, p.price as sale_price " +
>   s"FROM testcat.ns.$items i JOIN testcat.ns.$purchases p " +
>   "ON i.arrive_time = p.time " +
>   "ORDER BY id, purchase_price, p.item_id, sale_price")
> val shuffles = collectShuffles(df.queryExecution.executedPlan)
> assert(!shuffles.isEmpty, "should not perform SPJ as not all join keys 
> are partition keys")
> checkAnswer(df,
>   Seq(
> Row(1, "aa", 40.0, 1, 42.0),
> Row(1, "aa", 40.0, 2, 11.0),
> Row(1, "aa", 41.0, 1, 44.0),
> Row(1, "aa", 41.0, 1, 45.0),
> Row(2, "bb", 10.0, 1, 42.0),
> Row(2, "bb", 10.0, 2, 11.0),
> Row(2, "bb", 10.5, 1, 42.0),
> Row(2, "bb", 10.5, 2, 11.0),
> Row(3, "cc", 15.5, 3, 19.5)
>   )
> )
>   }
> }{code}
>  
> Note: this tests has setup the datasourceV2 to return multiple splits for 
> same partition.
> In this case, SPJ is not triggered (because join key does not match partition 
> key), but the following code in DSV2Scan:
> [https://github.com/apache/spark/blob/v3.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L194]
> intended to fill the empty partition for 'pushdown-vallue' will still iterate 
> through non-grouped partition and lookup from grouped partition to fill the 
> map, resulting in some duplicate input data fed into the join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly

2023-08-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-44683:
-
Fix Version/s: 3.5.1
   (was: 3.5.0)

> Logging level isn't passed to RocksDB state store provider correctly
> 
>
> Key: SPARK-44683
> URL: https://issues.apache.org/jira/browse/SPARK-44683
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Fix For: 4.0.0, 3.5.1
>
>
> We pass log4j's log level to RocksDB so that RocksDB debug log can go to 
> log4j. However, we pass in log level after we create the logger. However, the 
> way it is set isn't effective. This has two impacts: (1) setting DEBUG level 
> don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level 
> does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, 
> which is an unnecessary overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly

2023-08-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-44683:


Assignee: Siying Dong

> Logging level isn't passed to RocksDB state store provider correctly
> 
>
> Key: SPARK-44683
> URL: https://issues.apache.org/jira/browse/SPARK-44683
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
>
> We pass log4j's log level to RocksDB so that RocksDB debug log can go to 
> log4j. However, we pass in log level after we create the logger. However, the 
> way it is set isn't effective. This has two impacts: (1) setting DEBUG level 
> don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level 
> does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, 
> which is an unnecessary overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44683) Logging level isn't passed to RocksDB state store provider correctly

2023-08-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-44683.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42354
[https://github.com/apache/spark/pull/42354]

> Logging level isn't passed to RocksDB state store provider correctly
> 
>
> Key: SPARK-44683
> URL: https://issues.apache.org/jira/browse/SPARK-44683
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Fix For: 3.5.0, 4.0.0
>
>
> We pass log4j's log level to RocksDB so that RocksDB debug log can go to 
> log4j. However, we pass in log level after we create the logger. However, the 
> way it is set isn't effective. This has two impacts: (1) setting DEBUG level 
> don't make RocksDB generate DEBUG level logs; (2) setting WARN or ERROR level 
> does prevent INFO level logging, but RocksDB still makes JNI calls to Scala, 
> which is an unnecessary overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44694) Add Default & Active SparkSession for Python Client

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44694.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42371
[https://github.com/apache/spark/pull/42371]

> Add Default & Active SparkSession for Python Client
> ---
>
> Key: SPARK-44694
> URL: https://issues.apache.org/jira/browse/SPARK-44694
> Project: Spark
>  Issue Type: Task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> SPARK-43429 for Python side



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44694) Add Default & Active SparkSession for Python Client

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44694:


Assignee: Hyukjin Kwon

> Add Default & Active SparkSession for Python Client
> ---
>
> Key: SPARK-44694
> URL: https://issues.apache.org/jira/browse/SPARK-44694
> Project: Spark
>  Issue Type: Task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SPARK-43429 for Python side



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44715) Add missing udf and callUdf functions

2023-08-07 Thread Jira
Herman van Hövell created SPARK-44715:
-

 Summary: Add missing udf and callUdf functions
 Key: SPARK-44715
 URL: https://issues.apache.org/jira/browse/SPARK-44715
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44705) Make PythonRunner single-threaded

2023-08-07 Thread Utkarsh Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751843#comment-17751843
 ] 

Utkarsh Agarwal commented on SPARK-44705:
-

https://github.com/apache/spark/pull/42385 for this issue

> Make PythonRunner single-threaded
> -
>
> Key: SPARK-44705
> URL: https://issues.apache.org/jira/browse/SPARK-44705
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Utkarsh Agarwal
>Priority: Major
>
> PythonRunner, a utility that executes Python UDFs in Spark, uses two threads 
> in a producer-consumer model today. This multi-threading model is problematic 
> and confusing as Spark's execution model within a task is commonly understood 
> to be single-threaded. 
> More importantly, this departure of a double-threaded execution resulted in a 
> series of customer issues involving [race 
> conditions|https://issues.apache.org/jira/browse/SPARK-33277] and 
> [deadlocks|https://issues.apache.org/jira/browse/SPARK-38677] between threads 
> as the code was hard to reason about. There have been multiple attempts to 
> reign in these issues, viz., [fix 
> 1|https://issues.apache.org/jira/browse/SPARK-22535], [fix 
> 2|https://github.com/apache/spark/pull/30177], [fix 
> 3|https://github.com/apache/spark/commit/243c321db2f02f6b4d926114bd37a6e74c2be185].
>  Moreover, the fixes have made the code base somewhat abstruse by introducing 
> multiple daemon [monitor 
> threads|https://github.com/apache/spark/blob/a3a32912be04d3760cb34eb4b79d6d481bbec502/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L579]
>  to detect deadlocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44714) Ease restriction of LCA resolution regarding queries with having

2023-08-07 Thread Xinyi Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyi Yu updated SPARK-44714:
-
Description: 
Current LCA resolution has a limitation, that it can't resolve the query, when 
it satisfies all the following criteria:
 # the main (outer) query has having clause
 # there is a window expression in the query
 # in the same SELECT list as the window expression in 2), there is an lca

This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; 
window expressions won't be extracted until LCA in the same SELECT lists are 
rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, which 
could include the Window. It becomes a deadlock.

*We should ease some limitation on the LCA resolution regarding to having, to 
break the deadlock for most cases.*

For example, for the following query:
{code:java}
create table t (col boolean) using orc;
with w AS (
  select min(col) over () as min_alias,
  min_alias as col_alias
  FROM t
)
select col_alias
from w
having count > 0;
{code}

 
It now throws confusing error message:
{code:java}
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`col_alias` 
cannot be resolved. Did you mean one of the following? [`col_alias`, 
`min_alias`].{code}

The LCA and window is in a CTE that is completely unrelated to the having. LCA 
should resolve in this case.

  was:
Current LCA resolution has a limitation, that it can't resolve the query, when 
it satisfies all the following criteria:
 # the main (outer) query has having clause
 # there is a window expression in the query
 # in the same SELECT list as the window expression in 2), there is an lca

This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; 
window expressions won't be extracted until LCA in the same SELECT lists are 
rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, which 
could include the Window. It becomes a deadlock.

*We should ease some limitation on the LCA resolution regarding to having, to 
break the deadlock for most cases.*

For example, for the following query:
create table t (col boolean) using orc;

with w AS (select min(col) over () as min_alias,
   min_alias as col_aliasFROM t
)select col_alias from whaving count(*) > 0;
 
It now throws confusing error message:
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`col_alias` 
cannot be resolved. Did you mean one of the following? [`col_alias`, 
`min_alias`].
The LCA and window is in a CTE that is completely unrelated to the having. LCA 
should resolve in this case.


> Ease restriction of LCA resolution regarding queries with having
> 
>
> Key: SPARK-44714
> URL: https://issues.apache.org/jira/browse/SPARK-44714
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Xinyi Yu
>Priority: Major
>
> Current LCA resolution has a limitation, that it can't resolve the query, 
> when it satisfies all the following criteria:
>  # the main (outer) query has having clause
>  # there is a window expression in the query
>  # in the same SELECT list as the window expression in 2), there is an lca
> This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; 
> window expressions won't be extracted until LCA in the same SELECT lists are 
> rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, 
> which could include the Window. It becomes a deadlock.
> *We should ease some limitation on the LCA resolution regarding to having, to 
> break the deadlock for most cases.*
> For example, for the following query:
> {code:java}
> create table t (col boolean) using orc;
> with w AS (
>   select min(col) over () as min_alias,
>   min_alias as col_alias
>   FROM t
> )
> select col_alias
> from w
> having count > 0;
> {code}
>  
> It now throws confusing error message:
> {code:java}
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
> `col_alias` 
> cannot be resolved. Did you mean one of the following? [`col_alias`, 
> `min_alias`].{code}
> The LCA and window is in a CTE that is completely unrelated to the having. 
> LCA should resolve in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44714) Ease restriction of LCA resolution regarding queries with having

2023-08-07 Thread Xinyi Yu (Jira)
Xinyi Yu created SPARK-44714:


 Summary: Ease restriction of LCA resolution regarding queries with 
having
 Key: SPARK-44714
 URL: https://issues.apache.org/jira/browse/SPARK-44714
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.2, 3.5.0
Reporter: Xinyi Yu


Current LCA resolution has a limitation, that it can't resolve the query, when 
it satisfies all the following criteria:
 # the main (outer) query has having clause
 # there is a window expression in the query
 # in the same SELECT list as the window expression in 2), there is an lca

This is because LCA won't rewrite plan until UNRESOLVED_HAVING is resolved; 
window expressions won't be extracted until LCA in the same SELECT lists are 
rewritten; however UNRESOLVED_HAVING depends on the child to be resolved, which 
could include the Window. It becomes a deadlock.

*We should ease some limitation on the LCA resolution regarding to having, to 
break the deadlock for most cases.*

For example, for the following query:
create table t (col boolean) using orc;

with w AS (select min(col) over () as min_alias,
   min_alias as col_aliasFROM t
)select col_alias from whaving count(*) > 0;
 
It now throws confusing error message:
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 
`col_alias` 
cannot be resolved. Did you mean one of the following? [`col_alias`, 
`min_alias`].
The LCA and window is in a CTE that is completely unrelated to the having. LCA 
should resolve in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44713) Deduplicate files between sql/core and Spark Connect Scala Client

2023-08-07 Thread Jira
Herman van Hövell created SPARK-44713:
-

 Summary: Deduplicate files between sql/core and Spark Connect 
Scala Client
 Key: SPARK-44713
 URL: https://issues.apache.org/jira/browse/SPARK-44713
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SQL
Affects Versions: 3.5.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44712) Migrate ‎test_timedelta_ops assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44712:
---
Description: Migrate assert_eq to assertDataFrameEqual in this file: 
[‎python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py|https://github.com/databricks/runtime/blob/f579860299b16f6614f70c7cf2509cd89816d363/python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py#L176]
  (was: Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46])

> Migrate ‎test_timedelta_ops assert_eq to use assertDataFrameEqual
> -
>
> Key: SPARK-44712
> URL: https://issues.apache.org/jira/browse/SPARK-44712
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> Migrate assert_eq to assertDataFrameEqual in this file: 
> [‎python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py|https://github.com/databricks/runtime/blob/f579860299b16f6614f70c7cf2509cd89816d363/python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py#L176]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44712) Migrate ‎test_timedelta_ops assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44712:
--

 Summary: Migrate ‎test_timedelta_ops assert_eq to use 
assertDataFrameEqual
 Key: SPARK-44712
 URL: https://issues.apache.org/jira/browse/SPARK-44712
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44711) Migrate test_series_conversion assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44711:
---
Description: 
Migrate assert_eq to assertDataFrameEqual in this file: 
[‎python/pyspark/pandas/tests/test_series_conversion.py|https://github.com/databricks/runtime/blob/d162bd182de8bcea180d874027edb86ae8fc60e5/python/pyspark/pandas/tests/test_series_conversion.py#L63]

  was:Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]


> Migrate test_series_conversion assert_eq to use assertDataFrameEqual
> 
>
> Key: SPARK-44711
> URL: https://issues.apache.org/jira/browse/SPARK-44711
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> Migrate assert_eq to assertDataFrameEqual in this file: 
> [‎python/pyspark/pandas/tests/test_series_conversion.py|https://github.com/databricks/runtime/blob/d162bd182de8bcea180d874027edb86ae8fc60e5/python/pyspark/pandas/tests/test_series_conversion.py#L63]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44711) Migrate test_series_conversion assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44711:
--

 Summary: Migrate test_series_conversion assert_eq to use 
assertDataFrameEqual
 Key: SPARK-44711
 URL: https://issues.apache.org/jira/browse/SPARK-44711
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44710) Support Dataset.dropDuplicatesWithinWatermark in Scala Client

2023-08-07 Thread Jira
Herman van Hövell created SPARK-44710:
-

 Summary: Support Dataset.dropDuplicatesWithinWatermark in Scala 
Client
 Key: SPARK-44710
 URL: https://issues.apache.org/jira/browse/SPARK-44710
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44692) Move Trigger.scala to sql/api

2023-08-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-44692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44692.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Move Trigger.scala to sql/api
> -
>
> Key: SPARK-44692
> URL: https://issues.apache.org/jira/browse/SPARK-44692
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44132) nesting full outer joins confuses code generator

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44132.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 41712
[https://github.com/apache/spark/pull/41712]

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44132) nesting full outer joins confuses code generator

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44132:


Assignee: Steven Aerts

> nesting full outer joins confuses code generator
> 
>
> Key: SPARK-44132
> URL: https://issues.apache.org/jira/browse/SPARK-44132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
> Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {  //< null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44709) Fix flow control in ExecuteGrpcResponseSender

2023-08-07 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-44709:
-

 Summary: Fix flow control in ExecuteGrpcResponseSender
 Key: SPARK-44709
 URL: https://issues.apache.org/jira/browse/SPARK-44709
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0, 4.0.0
Reporter: Juliusz Sompolski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44708) Migrate test_reset_index assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44708:
--

 Summary: Migrate test_reset_index assert_eq to use 
assertDataFrameEqual
 Key: SPARK-44708
 URL: https://issues.apache.org/jira/browse/SPARK-44708
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44575) Implement Error Translation

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44575:


Assignee: Yihong He

> Implement Error Translation
> ---
>
> Key: SPARK-44575
> URL: https://issues.apache.org/jira/browse/SPARK-44575
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44575) Implement Error Translation

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44575.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42266
[https://github.com/apache/spark/pull/42266]

> Implement Error Translation
> ---
>
> Key: SPARK-44575
> URL: https://issues.apache.org/jira/browse/SPARK-44575
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Assignee: Yihong He
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44508) Add user guide for Python UDTFs

2023-08-07 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-44508:
-
Priority: Major  (was: Blocker)

> Add user guide for Python UDTFs
> ---
>
> Key: SPARK-44508
> URL: https://issues.apache.org/jira/browse/SPARK-44508
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Add documentation for Python UDTFs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44508) Add user guide for Python UDTFs

2023-08-07 Thread Allison Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751807#comment-17751807
 ] 

Allison Wang commented on SPARK-44508:
--

[~holden] Yup you're right. This isn't a blocker since we already have the 
documentation for the `udtf` function. I've updated the priority.
 

> Add user guide for Python UDTFs
> ---
>
> Key: SPARK-44508
> URL: https://issues.apache.org/jira/browse/SPARK-44508
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> Add documentation for Python UDTFs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43606) Enable IndexesTests.test_index_basic for pandas 2.0.0.

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43606:


Assignee: Haejoon Lee

> Enable IndexesTests.test_index_basic for pandas 2.0.0.
> --
>
> Key: SPARK-43606
> URL: https://issues.apache.org/jira/browse/SPARK-43606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Enable IndexesTests.test_index_basic for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43606) Enable IndexesTests.test_index_basic for pandas 2.0.0.

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43606.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42267
[https://github.com/apache/spark/pull/42267]

> Enable IndexesTests.test_index_basic for pandas 2.0.0.
> --
>
> Key: SPARK-43606
> URL: https://issues.apache.org/jira/browse/SPARK-43606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 4.0.0
>
>
> Enable IndexesTests.test_index_basic for pandas 2.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44707) Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped

2023-08-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44707.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42381
[https://github.com/apache/spark/pull/42381]

> Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped
> --
>
> Key: SPARK-44707
> URL: https://issues.apache.org/jira/browse/SPARK-44707
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44707) Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped

2023-08-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44707:
-

Assignee: Dongjoon Hyun

> Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped
> --
>
> Key: SPARK-44707
> URL: https://issues.apache.org/jira/browse/SPARK-44707
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44050) Unable to Mount ConfigMap in Driver Pod - ConfigMap Creation Issue

2023-08-07 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751796#comment-17751796
 ] 

Holden Karau commented on SPARK-44050:
--

Ah interesting, it sounds like the fix would be to retry config map creation on 
failure. 

> Unable to Mount ConfigMap in Driver Pod - ConfigMap Creation Issue
> --
>
> Key: SPARK-44050
> URL: https://issues.apache.org/jira/browse/SPARK-44050
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.3.1
>Reporter: Harshwardhan Singh Dodiya
>Priority: Critical
> Attachments: image-2023-06-14-11-07-36-960.png
>
>
> Dear Spark community,
> I am facing an issue related to mounting a ConfigMap in the driver pod of my 
> Spark application. Upon investigation, I realized that the problem is caused 
> by the ConfigMap not being created successfully.
> *Problem Description:*
> When attempting to mount the ConfigMap in the driver pod, I encounter 
> consistent failures and my pod stays in containerCreating state. Upon further 
> investigation, I discovered that the ConfigMap does not exist in the 
> Kubernetes cluster, which results in the driver pod's inability to access the 
> required configuration data.
> *Additional Information:*
> I would like to highlight that this issue is not a frequent occurrence. It 
> has been observed randomly, affecting the mounting of the ConfigMap in the 
> driver pod only approximately 5% of the time. This intermittent behavior adds 
> complexity to the troubleshooting process, as it is challenging to reproduce 
> consistently.
> *Error Message:*
> when describing driver pod (kubectl describe pod pod_name)  get the below 
> error.
> "ConfigMap '' not found."
> *To Reproduce:*
> 1. Download spark 3.3.1 from [https://spark.apache.org/downloads.html]
> 2. create an image with "bin/docker-image-tool.sh"
> 3. Submit on spark-client via bash command by passing all the details and 
> configurations.
> 4. Randomly in some of the driver pod we can observe this issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44508) Add user guide for Python UDTFs

2023-08-07 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751795#comment-17751795
 ] 

Holden Karau commented on SPARK-44508:
--

I'm not sure this should be a blocker.

> Add user guide for Python UDTFs
> ---
>
> Key: SPARK-44508
> URL: https://issues.apache.org/jira/browse/SPARK-44508
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Blocker
>
> Add documentation for Python UDTFs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44597:
---
Description: Migrate tests to new test utils in this file: 
python/pyspark/pandas/tests/test_sql.py  (was: The Jira ticket [SPARK-44042] 
SPIP: PySpark Test Framework  introduces a new PySpark test framework. Some of 
the user-facing testing util APIs include assertDataFrameEqual, 
assertSchemaEqual, and assertPandasOnSparkEqual.

With the new testing framework, we should migrate old tests in the Spark 
codebase to use the new testing utils.)

> Migrate test_sql assert_eq to use assertDataFrameEqual
> --
>
> Key: SPARK-44597
> URL: https://issues.apache.org/jira/browse/SPARK-44597
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Migrate tests to new test utils in this file: 
> python/pyspark/pandas/tests/test_sql.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44589) Migrate PySpark tests to use PySpark built-in test utils

2023-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44589:
---
Description: 
The Jira ticket SPARK-44042 SPIP: PySpark Test Framework introduces a new 
PySpark test framework. Some of the user-facing testing util APIs include 
assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual.

With the new testing framework, we should migrate old tests in the Spark 
codebase to use the new testing utils.

  was:Migrate existing tests in the PySpark codebase to use the new PySpark 
test utils, outlined here: https://issues.apache.org/jira/browse/SPARK-44042


> Migrate PySpark tests to use PySpark built-in test utils
> 
>
> Key: SPARK-44589
> URL: https://issues.apache.org/jira/browse/SPARK-44589
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> The Jira ticket SPARK-44042 SPIP: PySpark Test Framework introduces a new 
> PySpark test framework. Some of the user-facing testing util APIs include 
> assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual.
> With the new testing framework, we should migrate old tests in the Spark 
> codebase to use the new testing utils.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44561) Fix AssertionError when converting UDTF output to a complex type

2023-08-07 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-44561.
---
Fix Version/s: 3.5.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Issue resolved by pull request 42310
https://github.com/apache/spark/pull/42310

> Fix AssertionError when converting UDTF output to a complex type
> 
>
> Key: SPARK-44561
> URL: https://issues.apache.org/jira/browse/SPARK-44561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>
> {code:java}
> class TestUDTF:
> def eval(self):
> yield {'a': 1, 'b': 2},
> udtf(TestUDTF, returnType="x: map")().show() {code}
> This will fail with:
>   File "pandas/_libs/lib.pyx", line 2834, in pandas._libs.lib.map_infer
>   File "python/pyspark/sql/pandas/types.py", line 804, in convert_map
>     assert isinstance(value, dict)
> AssertionError
> Same for `convert_struct`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44707) Use INFO log in ExecutorPodsWatcher.onClose if SparkContext is stopped

2023-08-07 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-44707:
-

 Summary: Use INFO log in ExecutorPodsWatcher.onClose if 
SparkContext is stopped
 Key: SPARK-44707
 URL: https://issues.apache.org/jira/browse/SPARK-44707
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24282) Add support for PMML export for the Standard Scaler Stage

2023-08-07 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751758#comment-17751758
 ] 

Holden Karau commented on SPARK-24282:
--

I don't think were going to do this anymore.

> Add support for PMML export for the Standard Scaler Stage
> -
>
> Key: SPARK-24282
> URL: https://issues.apache.org/jira/browse/SPARK-24282
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Trivial
>
> In anticipation of supporting exporting multiple stages together to PMML it 
> would be good to support one of the common feature preparation stages, 
> StandardScaler.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28740) Add support for building with bloop

2023-08-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-28740.
--
Resolution: Won't Fix

> Add support for building with bloop
> ---
>
> Key: SPARK-28740
> URL: https://issues.apache.org/jira/browse/SPARK-28740
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Minor
>
> bloop can, in theory, build scala faster. However the JAR layout is a little 
> different when you try and run the tests. It would be useful if we updated 
> our test JAR discovery to work with bloop.
> Before working on this check to make sure that bloop it's self has changed to 
> work with Spark. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32111) Cleanup locks and docs in CoarseGrainedSchedulerBackend

2023-08-07 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751754#comment-17751754
 ] 

Holden Karau commented on SPARK-32111:
--

I think this could be a good target for Spark 4.0

> Cleanup locks and docs in CoarseGrainedSchedulerBackend
> ---
>
> Key: SPARK-32111
> URL: https://issues.apache.org/jira/browse/SPARK-32111
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Holden Karau
>Priority: Trivial
>
> Our locking is inconsistent inside of CoarseGrainedSchedulerBackend. In 
> production this doesn't matter because the methods in are called from inside 
> of ThreadSafeRPC but it would be better to cleanup the locks and docs to be 
> consistent especially since other schedulers inherit from 
> CoarseGrainedSchedulerBackend.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32111) Cleanup locks and docs in CoarseGrainedSchedulerBackend

2023-08-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-32111:
-
 Target Version/s: 4.0.0
Affects Version/s: 4.0.0

> Cleanup locks and docs in CoarseGrainedSchedulerBackend
> ---
>
> Key: SPARK-32111
> URL: https://issues.apache.org/jira/browse/SPARK-32111
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Holden Karau
>Priority: Trivial
>
> Our locking is inconsistent inside of CoarseGrainedSchedulerBackend. In 
> production this doesn't matter because the methods in are called from inside 
> of ThreadSafeRPC but it would be better to cleanup the locks and docs to be 
> consistent especially since other schedulers inherit from 
> CoarseGrainedSchedulerBackend.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38475) Use error classes in org.apache.spark.serializer

2023-08-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38475.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42243
[https://github.com/apache/spark/pull/42243]

> Use error classes in org.apache.spark.serializer
> 
>
> Key: SPARK-38475
> URL: https://issues.apache.org/jira/browse/SPARK-38475
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38475) Use error classes in org.apache.spark.serializer

2023-08-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38475:


Assignee: Bo Zhang

> Use error classes in org.apache.spark.serializer
> 
>
> Key: SPARK-38475
> URL: https://issues.apache.org/jira/browse/SPARK-38475
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)

2023-08-07 Thread Juan Carlos Blanco Martinez (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751728#comment-17751728
 ] 

Juan Carlos Blanco Martinez commented on SPARK-44706:
-

Using [Tuple1|https://www.scala-lang.org/api/2.13.4/scala/Tuple1.html] instead 
of DataRow serves as a workaround.

> ProductEncoder not working as expected within User Defined Fuction (UDF)
> 
>
> Key: SPARK-44706
> URL: https://issues.apache.org/jira/browse/SPARK-44706
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
> Environment: DBR (Databricks) 13.2, Spark 3.4.0 and Scala 2.12.
>Reporter: Juan Carlos Blanco Martinez
>Priority: Major
>
> Hi,
> When running the following code in Databricks' notebook:
> {code:java}
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
>   
>   case class DataRow(field1: String)
>   val sparkSession = SparkSession.builder.getOrCreate()
>   import sparkSession.implicits._
>   val udf1 = udf((x: String) => {
>     val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // 
> This is failing at runtime
>     3
>   })
>   val df1 = Seq(DataRow("test1"), 
> DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
> working
>   display(df1)
> {code}
> I get a ScalaReflectionException at runtime:
> {code:java}
> ScalaReflectionException: class 
> $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read in JavaMirror with 
> com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
>  of type class 
> com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with 
> classpath [] and parent being 
> com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897
>  of type class 
> com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader 
> with classpath [] and parent being 
> com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a 
> of type class 
> com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with 
> classpath 
> [file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/]
>  and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class 
> sun.misc.Launcher$AppClassLoader with classpath [...] not found. 
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) 
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) 
> at 
> $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11)
>  
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) 
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) 
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848)
>  
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55)
>  
> at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) 
> at 
> org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302)
>  
> at 
> org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302)
>  
> at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) 
> at 
> $linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11)
>  {code}
> Declaring the case class within the UDF:
> {code:java}
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
>  
>   case class DataRow(field1: String)
>   val sparkSession = SparkSession.builder.getOrCreate()
>   import sparkSession.implicits._  
>   val udf1 = udf((x: String) => {
>     case class DataRow(field1: String)
>     val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // 
> This is failing at runtime
>     3
>   })  
>   val df1 = Seq(DataRow("test1"), 
> DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
> working
>   display(df1) {code}
> I get a different error
> {code:java}
> error: value toDF is not a member of Seq[DataRow] val testData = 
> Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code}
> Questions:
> * Is this an expected behaviour or a bug?
> Thanks.
> Thanks.



--
This message was sent by Atlassian Jira

[jira] [Resolved] (SPARK-44068) Support positional parameters in Scala connect client

2023-08-07 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44068.
--
Resolution: Duplicate

> Support positional parameters in Scala connect client
> -
>
> Key: SPARK-44068
> URL: https://issues.apache.org/jira/browse/SPARK-44068
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Major
>
> Implement positional parameters of parametrized queries in the Scala connect 
> client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)

2023-08-07 Thread Juan Carlos Blanco Martinez (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Carlos Blanco Martinez updated SPARK-44706:

Description: 
Hi,

When running the following code in Databricks' notebook:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a ScalaReflectionException at runtime:
{code:java}
ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read 
in JavaMirror with 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
 of type class 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897
 of type class 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a 
of type class 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with 
classpath 
[file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/]
 and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] not found. 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11)
 
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) 
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) 
at 
org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848)
 
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55)
 
at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302)
 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302)
 
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11)
 {code}
Declaring the case class within the UDF:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
 
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._  
  val udf1 = udf((x: String) => {
    case class DataRow(field1: String)
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })  
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1) {code}
I get a different error
{code:java}
error: value toDF is not a member of Seq[DataRow] val testData = 
Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code}

Questions:
* Is this an expected behaviour or a bug?

Thanks.

Thanks.

  was:
Hi,

When running the following code in Databricks' notebook:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a ScalaReflectionException at runtime:
{code:java}
ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read 
in JavaMirror with 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
 of type class 

[jira] [Updated] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)

2023-08-07 Thread Juan Carlos Blanco Martinez (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Carlos Blanco Martinez updated SPARK-44706:

Description: 
Hi,

When running the following code in Databricks' notebook:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a ScalaReflectionException at runtime:
{code:java}
ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read 
in JavaMirror with 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
 of type class 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897
 of type class 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a 
of type class 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with 
classpath 
[file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/]
 and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] not found. 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11)
 
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) 
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) 
at 
org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848)
 
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55)
 
at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302)
 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302)
 
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11)
 {code}
Declaring the case class within the UDF:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
 
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._  
  val udf1 = udf((x: String) => {
    case class DataRow(field1: String)
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })  
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1) {code}
I get a different error
{code:java}
error: value toDF is not a member of Seq[DataRow] val testData = 
Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code}
Thanks.

  was:
Hi,

When running the following code in Databricks' notebook:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a ScalaReflectionException at runtime:
{code:java}
ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read 
in JavaMirror with 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
 of type class 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with 
classpath [] and 

[jira] [Updated] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)

2023-08-07 Thread Juan Carlos Blanco Martinez (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Carlos Blanco Martinez updated SPARK-44706:

Description: 
Hi,

When running the following code in Databricks' notebook:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a ScalaReflectionException at runtime:
{code:java}
ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read 
in JavaMirror with 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
 of type class 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897
 of type class 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a 
of type class 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with 
classpath 
[file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/]
 and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] not found. 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11)
 
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) 
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) 
at 
org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848)
 
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55)
 
at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302)
 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302)
 
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11)
 {code}
Declaring the case class within the UDF:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}

  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._  
  val udf1 = udf((x: String) => {
    case class DataRow(field1: String)
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })  
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1) {code}
I get a different error
{code:java}
error: value toDF is not a member of Seq[DataRow] val testData = 
Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code}
Thanks.

  was:
Hi,

When running the following code in Databricks' notebook:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a ScalaReflectionException at runtime:
{code:java}
ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read 
in JavaMirror with 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
 of type class 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with 
classpath [] and parent being 

[jira] [Updated] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)

2023-08-07 Thread Juan Carlos Blanco Martinez (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Carlos Blanco Martinez updated SPARK-44706:

Description: 
Hi,

When running the following code in Databricks' notebook:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a ScalaReflectionException at runtime:
{code:java}
ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read 
in JavaMirror with 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
 of type class 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897
 of type class 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a 
of type class 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with 
classpath 
[file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/]
 and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] not found. 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11)
 
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) 
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) 
at 
org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848)
 
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55)
 
at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302)
 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302)
 
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11)
 {code}
Declaring the case class within the UDF:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}    val 
sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._  val udf1 = udf((x: String) => {
    case class DataRow(field1: String)
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1) {code}
I get a different error
{code:java}
error: value toDF is not a member of Seq[DataRow] val testData = 
Seq(DataRow("test1"), DataRow("test2")).toDF("test"){code}
Thanks.

  was:
Hi,

When running the following code in Databricks' notebook:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a ScalaReflectionException at runtime:
{code:java}
ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read 
in JavaMirror with 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
 of type class 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with 
classpath [] and parent being 

[jira] [Created] (SPARK-44706) ProductEncoder not working as expected within User Defined Fuction (UDF)

2023-08-07 Thread Juan Carlos Blanco Martinez (Jira)
Juan Carlos Blanco Martinez created SPARK-44706:
---

 Summary: ProductEncoder not working as expected within User 
Defined Fuction (UDF)
 Key: SPARK-44706
 URL: https://issues.apache.org/jira/browse/SPARK-44706
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
 Environment: DBR (Databricks) 13.2, Spark 3.4.0 and Scala 2.12.
Reporter: Juan Carlos Blanco Martinez


Hi,

When running the following code in Databricks' notebook:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a ScalaReflectionException at runtime:
{code:java}
ScalaReflectionException: class $linedb3da7b2933d4b63a62b3d2a21c675f2141.$read 
in JavaMirror with 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader@717115ad
 of type class 
com.databricks.backend.daemon.driver.DriverLocal$DriverLocalClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader@1670897
 of type class 
com.databricks.backend.daemon.driver.ClassLoaders$ReplWrappingClassLoader with 
classpath [] and parent being 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1a42da0a 
of type class 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader with 
classpath 
[file:/local_disk0/tmp/repl/spark-4071811259476162981-8e526ae9-25fb-4545-8d3f-963a8661cd2b/]
 and parent being sun.misc.Launcher$AppClassLoader@43ee72e6 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] not found. 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:141) 
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:29) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator6$1.apply(command-2013905963200886:11)
 
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:237) 
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:237) 
at 
org.apache.spark.sql.catalyst.ScalaReflection$.encoderFor(ScalaReflection.scala:848)
 
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55)
 
at org.apache.spark.sql.Encoders$.product(Encoders.scala:312) 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:302)
 
at 
org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:302)
 
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) 
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1(command-2013905963200886:11)
 {code}
Declaring the case class within the UDF:
{code:java}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
  case class DataRow(field1: String)
  val sparkSession = SparkSession.builder.getOrCreate()
  import sparkSession.implicits._
  val udf1 = udf((x: String) => {
    val testData = Seq(DataRow("test1"), DataRow("test2")).toDF("test") // This 
is failing at runtime
    3
  })
  val df1 = Seq(DataRow("test1"), 
DataRow("test2")).toDF("test").withColumn("udf", udf1($"test")) // This is 
working
  display(df1)
{code}
I get a different exception at runtime:


{code:java}
at 
$linedb3da7b2933d4b63a62b3d2a21c675f2196.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iwc35395a9233a7197629c985e87dce75w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$udf1$1$adapted(command-2013905963200886:10)
 
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:223)
 
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1205) 
... 111 more{code}

Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44705) Make PythonRunner single-threaded

2023-08-07 Thread Utkarsh Agarwal (Jira)
Utkarsh Agarwal created SPARK-44705:
---

 Summary: Make PythonRunner single-threaded
 Key: SPARK-44705
 URL: https://issues.apache.org/jira/browse/SPARK-44705
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Utkarsh Agarwal



PythonRunner, a utility that executes Python UDFs in Spark, uses two threads in 
a producer-consumer model today. This multi-threading model is problematic and 
confusing as Spark's execution model within a task is commonly understood to be 
single-threaded. 
More importantly, this departure of a double-threaded execution resulted in a 
series of customer issues involving [race 
conditions|https://issues.apache.org/jira/browse/SPARK-33277] and 
[deadlocks|https://issues.apache.org/jira/browse/SPARK-38677] between threads 
as the code was hard to reason about. There have been multiple attempts to 
reign in these issues, viz., [fix 
1|https://issues.apache.org/jira/browse/SPARK-22535], [fix 
2|https://github.com/apache/spark/pull/30177], [fix 
3|https://github.com/apache/spark/commit/243c321db2f02f6b4d926114bd37a6e74c2be185].
 Moreover, the fixes have made the code base somewhat abstruse by introducing 
multiple daemon [monitor 
threads|https://github.com/apache/spark/blob/a3a32912be04d3760cb34eb4b79d6d481bbec502/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L579]
 to detect deadlocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44634) Encoders.bean does no longer support nested beans with type arguments

2023-08-07 Thread Giambattista Bloisi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giambattista Bloisi resolved SPARK-44634.
-
Fix Version/s: 3.4.2
   3.5.0
   4.0.0
   Resolution: Fixed

PR has been merged

> Encoders.bean does no longer support nested beans with type arguments
> -
>
> Key: SPARK-44634
> URL: https://issues.apache.org/jira/browse/SPARK-44634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: Giambattista Bloisi
>Priority: Major
> Fix For: 3.4.2, 3.5.0, 4.0.0
>
>
> Hi,
>   while upgrading a project from spark 2.4.0 to 3.4.1 version, I have 
> encountered the same problem described in [java - Encoders.bean attempts to 
> check the validity of a return type considering its generic type and not its 
> concrete class, with Spark 3.4.0 - Stack 
> Overflow|https://stackoverflow.com/questions/76045255/encoders-bean-attempts-to-check-the-validity-of-a-return-type-considering-its-ge].
> Put it short, starting from Spark 3.4.x Encoders.bean throws an exception 
> when the passed class contains a field whose type is a nested bean with type 
> arguments:
>  
> {code:java}
> class A {
>T value;
>// value getter and setter
> }
> class B {
>A stringHolder;
>// stringHolder getter and setter
> }
> Encoders.bean(B.class); // throws "SparkUnsupportedOperationException: 
> [ENCODER_NOT_FOUND]..."{code}
>  
>  
> It looks like this is a regression introduced with [SPARK-42093 SQL Move 
> JavaTypeInference to 
> AgnosticEncoders|https://github.com/apache/spark/commit/18672003513d5a4aa610b6b94dbbc15c33185d3#diff-1191737b908340a2f4c22b71b1c40ebaa0da9d8b40c958089c346a3bda26943b]
>  while getting rid of TypeToken, that somehow managed that case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44697) Clean up the usage of org.apache.commons.lang3.RandomUtils

2023-08-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44697:


Assignee: Yang Jie

> Clean up the usage of org.apache.commons.lang3.RandomUtils
> --
>
> Key: SPARK-44697
> URL: https://issues.apache.org/jira/browse/SPARK-44697
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> {code:java}
> /**
>  * Utility library that supplements the standard {@link Random} class.
>  *
>  * Caveat: Instances of {@link Random} are not cryptographically 
> secure.
>  *
>  * Please note that the Apache Commons project provides a component
>  * dedicated to pseudo-random number generation, namely
>  * https://commons.apache.org/proper/commons-rng/;>Commons RNG, 
> that may be
>  * a better choice for applications with more stringent requirements
>  * (performance and/or correctness).
>  *
>  * @deprecated Use Apache Commons RNG's optimized  href="https://commons.apache.org/proper/commons-rng/commons-rng-client-api/apidocs/org/apache/commons/rng/UniformRandomProvider.html;>UniformRandomProvider
>  * @since 3.3
>  */
> @Deprecated
> public class RandomUtils { {code}
> In {{commons-lang3}} 3.13.0, {{RandomUtils}} has been marked as 
> {{@Deprecated}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44697) Clean up the usage of org.apache.commons.lang3.RandomUtils

2023-08-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44697.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42370
[https://github.com/apache/spark/pull/42370]

> Clean up the usage of org.apache.commons.lang3.RandomUtils
> --
>
> Key: SPARK-44697
> URL: https://issues.apache.org/jira/browse/SPARK-44697
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 4.0.0
>
>
> {code:java}
> /**
>  * Utility library that supplements the standard {@link Random} class.
>  *
>  * Caveat: Instances of {@link Random} are not cryptographically 
> secure.
>  *
>  * Please note that the Apache Commons project provides a component
>  * dedicated to pseudo-random number generation, namely
>  * https://commons.apache.org/proper/commons-rng/;>Commons RNG, 
> that may be
>  * a better choice for applications with more stringent requirements
>  * (performance and/or correctness).
>  *
>  * @deprecated Use Apache Commons RNG's optimized  href="https://commons.apache.org/proper/commons-rng/commons-rng-client-api/apidocs/org/apache/commons/rng/UniformRandomProvider.html;>UniformRandomProvider
>  * @since 3.3
>  */
> @Deprecated
> public class RandomUtils { {code}
> In {{commons-lang3}} 3.13.0, {{RandomUtils}} has been marked as 
> {{@Deprecated}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44686) Add option to create RowEncoder in Encoders helper class.

2023-08-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-44686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751671#comment-17751671
 ] 

Herman van Hövell commented on SPARK-44686:
---

 This has been merged to master/3.5. Preparing backports for 3.4/3.3.

> Add option to create RowEncoder in Encoders helper class.
> -
>
> Key: SPARK-44686
> URL: https://issues.apache.org/jira/browse/SPARK-44686
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 3.5.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44704) Cleanup shuffle files from host node after migration due to graceful decommissioning

2023-08-07 Thread Deependra Patel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751654#comment-17751654
 ] 

Deependra Patel commented on SPARK-44704:
-

I will create for this soon.

> Cleanup shuffle files from host node after migration due to graceful 
> decommissioning
> 
>
> Key: SPARK-44704
> URL: https://issues.apache.org/jira/browse/SPARK-44704
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.4.1
>Reporter: Deependra Patel
>Priority: Minor
>
> Although these files will be deleted at the end of the application by the 
> external shuffle service, doing this early can free up resources and can help 
> in long running applications running out of disk space.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44704) Cleanup shuffle files from host node after migration due to graceful decommissioning

2023-08-07 Thread Deependra Patel (Jira)
Deependra Patel created SPARK-44704:
---

 Summary: Cleanup shuffle files from host node after migration due 
to graceful decommissioning
 Key: SPARK-44704
 URL: https://issues.apache.org/jira/browse/SPARK-44704
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.4.1
Reporter: Deependra Patel


Although these files will be deleted at the end of the application by the 
external shuffle service, doing this early can free up resources and can help 
in long running applications running out of disk space.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44694) Add Default & Active SparkSession for Python Client

2023-08-07 Thread Nikita Awasthi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751634#comment-17751634
 ] 

Nikita Awasthi commented on SPARK-44694:


User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/42371

> Add Default & Active SparkSession for Python Client
> ---
>
> Key: SPARK-44694
> URL: https://issues.apache.org/jira/browse/SPARK-44694
> Project: Spark
>  Issue Type: Task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-43429 for Python side



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44701) Skip ClassificationTestsOnConnect when torch is not installed

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44701:


Assignee: Ruifeng Zheng

> Skip ClassificationTestsOnConnect when torch is not installed
> -
>
> Key: SPARK-44701
> URL: https://issues.apache.org/jira/browse/SPARK-44701
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44701) Skip ClassificationTestsOnConnect when torch is not installed

2023-08-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44701.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42375
[https://github.com/apache/spark/pull/42375]

> Skip ClassificationTestsOnConnect when torch is not installed
> -
>
> Key: SPARK-44701
> URL: https://issues.apache.org/jira/browse/SPARK-44701
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20080) Spak streaming application do not throw serialisation exception in foreachRDD

2023-08-07 Thread Nick Hryhoriev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev updated SPARK-20080:
---
Description: 
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd, foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. foreachPartition action not executed.
Expected result:
Throw java.io.NotSerializableException

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get exception ->
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
at TraitWithMethod$class.executeForEachpartitoin(TraitWithMethod.scala:12)
at ReproduceBugMain$.executeForEachpartitoin(ReproduceBugMain.scala:7)
at ReproduceBugMain$.main(ReproduceBugMain.scala:14)
at ReproduceBugMain.main(ReproduceBugMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.io.NotSerializableException: ReproduceBugMain$
Serialization stack:
- object not serializable (class: ReproduceBugMain$, value: 
ReproduceBugMain$@3935e9a8)
- field (class: TraitWithMethod$$anonfun$executeForEachpartitoin$1, name: 
$outer, type: interface TraitWithMethod)
- object (class TraitWithMethod$$anonfun$executeForEachpartitoin$1, 
)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 18 more
{code}
On Github can be found 2 commit. 1 initial i add link on it(this one contain 
sptreaming example). and Last one with batch job example

  was:
Step to reproduce:
1)Use SPak Streaming application.
2) inside foreachRDD of any DStream, use rdd,foreachPartition.
3) use org.slf4j.Logger. Init it or use as a filed with closure. inside 
foreachPartition action.

Result:
No exception throw. foreachPartition action not executed.
Expected result:
Throw java.io.NotSerializableException

description:
When i try use or init org.slf4j.Logger inside foreachPartition, that extracted 
to trait method. 
What was called in foreachRDD. 
I have found that foreachPartition method do not execute and no exception 
appeared.
Tested on local and yarn mode spark.

code can be found on 
[github|https://github.com/GrigorievNick/Spark2_1TraitLoggerSerialisationBug/tree/9da55393850df9fe19f5ff3e63b47ec2d1f67e17].
 There are two main class that explain problem.

if i will run same code with batch job. I will get exception ->
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at 

[jira] [Resolved] (SPARK-31427) Spark Structure streaming read data twice per every micro-batch.

2023-08-07 Thread Nick Hryhoriev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev resolved SPARK-31427.

Resolution: Invalid

> Spark Structure streaming read data twice per every micro-batch.
> 
>
> Key: SPARK-31427
> URL: https://issues.apache.org/jira/browse/SPARK-31427
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3, 2.4.7, 3.0.1, 3.0.2
>Reporter: Nick Hryhoriev
>Priority: Major
>
> I have a very strange issue with spark structure streaming. Spark structure 
> streaming creates two spark jobs for every micro-batch. As a result, read 
> data from Kafka twice. Here is a simple code snippet.
>  
> {code:java}
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.streaming.Trigger
> object CheckHowSparkReadFromKafka {
>   def main(args: Array[String]): Unit = {
> val session = SparkSession.builder()
>   .config(new SparkConf()
> .setAppName(s"simple read from kafka with repartition")
> .setMaster("local[*]")
> .set("spark.driver.host", "localhost"))
>   .getOrCreate()
> val testPath = "/tmp/spark-test"
> FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new 
> Path(testPath), true)
> import session.implicits._
> val stream = session
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers","kafka-20002-prod:9092")
>   .option("subscribe", "topic")
>   .option("maxOffsetsPerTrigger", 1000)
>   .option("failOnDataLoss", false)
>   .option("startingOffsets", "latest")
>   .load()
>   .repartitionByRange( $"offset")
>   .writeStream
>   .option("path", testPath + "/data")
>   .option("checkpointLocation", testPath + "/checkpoint")
>   .format("parquet")
>   .trigger(Trigger.ProcessingTime(10.seconds))
>   .start()
> stream.processAllAvailable()
> {code}
> This happens because if {{.repartitionByRange( $"offset")}}, if I remove this 
> line, all good. But with spark create two jobs, one with 1 stage just read 
> from Kafka, the second with 3 stage read -> shuffle -> write. So the result 
> of the first job never used.
> This has a significant impact on performance. Some of my Kafka topics have 
> 1550 partitions, so read them twice is a big deal. In case I add cache, 
> things going better, but this is not a way for me. In local mode, the first 
> job in batch takes less than 0.1 ms, except batch with index 0. But in YARN 
> cluster and Messos both jobs fully expected and on my topics take near 1.2 
> min.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34204) When use input_file_name() func all column from file appeared in physical plan of query, not only projection.

2023-08-07 Thread Nick Hryhoriev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev resolved SPARK-34204.

Fix Version/s: 3.1.1
   Resolution: Fixed

> When use input_file_name() func all column from file appeared in physical 
> plan of query, not only projection.
> -
>
> Key: SPARK-34204
> URL: https://issues.apache.org/jira/browse/SPARK-34204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Nick Hryhoriev
>Priority: Major
> Fix For: 3.1.1
>
>
> input_file_name() function damage applying projection to the physical plan of 
> the query.
>  if use this function and a new column, column-oriented formats like parquet 
> and orc put all columns to Physical plan.
>  While without it, only selected columns uploaded.
>  In my case, performance influence is x30.
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> object TestSize {
>   def main(args: Array[String]): Unit = {
> implicit val spark: SparkSession = SparkSession.builder()
>   .master("local")
>   .config("spark.sql.shuffle.partitions", "5")
>   .getOrCreate()
> import spark.implicits._
> val query1 = spark.read.parquet(
>   "s3a://part-00040-a19f0d20-eab3-48ef-be5a-602c7f9a8e58.c000.gz.parquet"
> )
>   .select($"app_id", $"idfa", input_file_name().as("fileName"))
>   .distinct()
>   .count()
>val query2 = spark.read.parquet( 
> "s3a://part-00040-a19f0d20-eab3-48ef-be5a- 602c7f9a8e58.c000.gz.parquet" ) 
>   .select($"app_id", $"idfa")
>   .distinct() 
>   .count()
> Thread.sleep(100L)
>   }
> }
> {code}
> `query1` has all columns in the physical plan, while `query2` only two.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35094) Spark from_json(JsonToStruct) function return wrong value in permissive mode in case best effort

2023-08-07 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751630#comment-17751630
 ] 

Nick Hryhoriev commented on SPARK-35094:


I miss understand, do the issue resolved?

> Spark from_json(JsonToStruct)  function return wrong value in permissive mode 
> in case best effort
> -
>
> Key: SPARK-35094
> URL: https://issues.apache.org/jira/browse/SPARK-35094
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Nick Hryhoriev
>Priority: Major
>
> I use spark 3.1.1 and 3.0.2.
>  Function `from_json` return wrong result with Permissive mode.
>  In corner case:
>  1. Json message has complex nested structure
>  \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, 
> sameNameField_WhichValueWillAppearInwrongPlace}}
>  2. Nested -> Nested Field: Schema is satisfy align with value in json.
> scala code to reproduce:
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions.from_json
> import org.apache.spark.sql.types.IntegerType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> object Main {
>   def main(args: Array[String]): Unit = {
> implicit val spark: SparkSession = 
> SparkSession.builder().master("local[*]").getOrCreate()
> import spark.implicits._
> val schemaForFieldWhichWillHaveWrongValue = 
> StructField("problematicName", StringType, nullable = true)
> val nestedFieldWhichNotSatisfyJsonMessage = StructField(
>   "badNestedField",
>   StructType(Seq(StructField("SomethingWhichNotInJsonMessage", 
> IntegerType, nullable = true)))
> )
> val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage =
>   StructField(
> "nestedField",
> StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, 
> schemaForFieldWhichWillHaveWrongValue))
>   )
> val customSchema = StructType(Seq(
>   schemaForFieldWhichWillHaveWrongValue,
>   nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage
> ))
> val jsonStringToTest =
>   
> """{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}"""
> val df = List(jsonStringToTest)
>   .toDF("json")
>   // issue happen only in permissive mode during best effort
>   .select(from_json($"json", customSchema).as("toBeFlatten"))
>   .select("toBeFlatten.*")
> df.show(truncate = false)
> assert(
>   df.select("problematicName").as[String].first() == 
> "ThisValueWillBeOverwritten",
>   "wrong value in root schema, parser take value from column with same 
> name but in another nested elvel"
> )
>   }
> }
> {code}
> I was not able to debug this issue, to find the exact root cause.
>  But what I find in debug, that In 
> `org.apache.spark.sql.catalyst.util.FailureSafeParser` in line 64. code block 
> `e.partialResult()` already have a wrong value.
> I hope this will help to fix the issue.
> I do a DIRTY HACK to fix the issue.
>  I just fork this function and hardcode `None` -> `Iterator(toResultRow(None, 
> e.record))`.
>  In my case, it's better to do not have any values in the row, than 
> theoretically have a wrong value in some column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44703) Log eventLog rewrite duration when compact old event log files

2023-08-07 Thread shuyouZZ (Jira)
shuyouZZ created SPARK-44703:


 Summary: Log eventLog rewrite duration when compact old event log 
files
 Key: SPARK-44703
 URL: https://issues.apache.org/jira/browse/SPARK-44703
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: shuyouZZ
 Fix For: 3.5.0


When enable {{spark.eventLog.rolling.enabled}} and the number of eventLog files 
exceeds the value of {{{}spark.history.fs.eventLog.rolling.maxFilesToRetain{}}},
HistoryServer will compact the old event log files into one compact file.

Currently there is no log the rewrite duration in {{rewrite}} method, this 
metric is useful for understand the compact duration, so we need add logs in 
the method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40497) Upgrade Scala to 2.13.11

2023-08-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40497:
-
Fix Version/s: (was: 3.5.0)

> Upgrade Scala to 2.13.11
> 
>
> Key: SPARK-40497
> URL: https://issues.apache.org/jira/browse/SPARK-40497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> We tested and decided to skip the following releases. This issue aims to use 
> 2.13.11.
> - 2022-09-21: v2.13.9 released 
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> - 2022-10-13: 2.13.10 released 
> [https://github.com/scala/scala/releases/tag/v2.13.10]
>  
> Scala 2.13.11 Milestone
> - https://github.com/scala/scala/milestone/100



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40497) Upgrade Scala to 2.13.11

2023-08-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40497:
-
Affects Version/s: 4.0.0
   (was: 3.4.0)

> Upgrade Scala to 2.13.11
> 
>
> Key: SPARK-40497
> URL: https://issues.apache.org/jira/browse/SPARK-40497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> We tested and decided to skip the following releases. This issue aims to use 
> 2.13.11.
> - 2022-09-21: v2.13.9 released 
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> - 2022-10-13: 2.13.10 released 
> [https://github.com/scala/scala/releases/tag/v2.13.10]
>  
> Scala 2.13.11 Milestone
> - https://github.com/scala/scala/milestone/100



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40497) Upgrade Scala to 2.13.11

2023-08-07 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reopened SPARK-40497:
--

 reopen this due to SPARK-44690 & SPARK-44376 downgrade to 2.13.8, 

We need to upgrade again after finding a method that can solve the problems 
described in SPARK-44690 & SPARK-44376

> Upgrade Scala to 2.13.11
> 
>
> Key: SPARK-40497
> URL: https://issues.apache.org/jira/browse/SPARK-40497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> We tested and decided to skip the following releases. This issue aims to use 
> 2.13.11.
> - 2022-09-21: v2.13.9 released 
> [https://github.com/scala/scala/releases/tag/v2.13.9]
> - 2022-10-13: 2.13.10 released 
> [https://github.com/scala/scala/releases/tag/v2.13.10]
>  
> Scala 2.13.11 Milestone
> - https://github.com/scala/scala/milestone/100



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44702) Cannot derive ExpressionEncoder when using types tagged by a trait

2023-08-07 Thread Ben Hemsi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751559#comment-17751559
 ] 

Ben Hemsi commented on SPARK-44702:
---

This is my first time working on spark. I'm happy to be assigned to the issue, 
but I don't know how to assign myself to the issue.

I've added the affects versions which I've reproduced the issue on, but I think 
the bug affects all version.

> Cannot derive ExpressionEncoder when using types tagged by a trait
> --
>
> Key: SPARK-44702
> URL: https://issues.apache.org/jira/browse/SPARK-44702
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.4, 3.3.2
>Reporter: Ben Hemsi
>Priority: Minor
>  Labels: spark-sql
>
> Steps to reproduce:
> {code:java}
> trait Tag
> case class Foo(x: Int)
> case class Bar(x: Foo with Tag)
> ExpressionEncoder.apply[Bar](){code}
> the ExpressionEncoder throws the following error (this was for 3.1.2):
> {code:java}
> scala.MatchError: Foo with Tag (of class 
> scala.reflect.internal.Types$RefinedType0)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49)
>  {code}
> The bug is [on this line (on 
> master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461].
> {code:java}
> val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code}
> which is an incomplete pattern match. The pattern match needs be extended to 
> include matching on RefinedType as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44702) Cannot derive ExpressionEncoder when using types tagged by a trait

2023-08-07 Thread Ben Hemsi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Hemsi updated SPARK-44702:
--
Priority: Minor  (was: Major)

> Cannot derive ExpressionEncoder when using types tagged by a trait
> --
>
> Key: SPARK-44702
> URL: https://issues.apache.org/jira/browse/SPARK-44702
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.4, 3.3.2
>Reporter: Ben Hemsi
>Priority: Minor
>  Labels: spark-sql
>
> Steps to reproduce:
> {code:java}
> trait Tag
> case class Foo(x: Int)
> case class Bar(x: Foo with Tag)
> ExpressionEncoder.apply[Bar](){code}
> the ExpressionEncoder throws the following error (this was for 3.1.2):
> {code:java}
> scala.MatchError: Foo with Tag (of class 
> scala.reflect.internal.Types$RefinedType0)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49)
>  {code}
> The bug is [on this line (on 
> master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461].
> {code:java}
> val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code}
> which is an incomplete pattern match. The pattern match needs be extended to 
> include matching on RefinedType as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751555#comment-17751555
 ] 

ASF GitHub Bot commented on SPARK-44700:


User 'monkeyboy123' has created a pull request for this issue:
https://github.com/apache/spark/pull/42376

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: jiahong.li
>Priority: Minor
>
> _SQL_ like below:
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} includes more than 100 fields.
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44702) Cannot derive ExpressionEncoder when using types tagged by a trait

2023-08-07 Thread Ben Hemsi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Hemsi updated SPARK-44702:
--
Description: 
Steps to reproduce:
{code:java}
trait Tag
case class Foo(x: Int)
case class Bar(x: Foo with Tag)
ExpressionEncoder.apply[Bar](){code}
the ExpressionEncoder throws the following error (this was for 3.1.2):
{code:java}
scala.MatchError: Foo with Tag (of class 
scala.reflect.internal.Types$RefinedType0)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49)
 {code}
The bug is [on this line (on 
master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461].
{code:java}
val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code}
which is an incomplete pattern match. The pattern match needs be extended to 
include matching on RefinedType as well.

  was:
Steps to reproduce:

 
{code:java}
trait Tag
case class Foo(x: Int)
case class Bar(x: Foo with Tag)
ExpressionEncoder.apply[Bar](){code}
the ExpressionEncoder throws the following error (this was for 3.1.2):

 
{code:java}
scala.MatchError: Foo with Tag (of class 
scala.reflect.internal.Types$RefinedType0)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49)
 {code}
The bug is [on this line (on 
master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461].
{code:java}
val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code}
which is an incomplete pattern match. The pattern match needs be extended to 
include matching on RefinedType as well.

 

 


> Cannot derive ExpressionEncoder when using types tagged by a trait
> --
>
> Key: SPARK-44702
> URL: https://issues.apache.org/jira/browse/SPARK-44702
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.4, 3.3.2
>Reporter: Ben Hemsi
>Priority: Major
>  Labels: spark-sql
>
> Steps to reproduce:
> {code:java}
> trait Tag
> case class Foo(x: Int)
> case class Bar(x: Foo with Tag)
> ExpressionEncoder.apply[Bar](){code}
> the ExpressionEncoder throws the following error (this was for 3.1.2):
> {code:java}
> scala.MatchError: Foo with Tag (of class 
> scala.reflect.internal.Types$RefinedType0)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49)
>  {code}
> The bug is [on this line (on 
> master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461].
> {code:java}
> val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code}
> which is an incomplete pattern match. The pattern match needs be extended to 
> include matching on RefinedType as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44702) Cannot derive ExpressionEncoder when using types tagged by a trait

2023-08-07 Thread Ben Hemsi (Jira)
Ben Hemsi created SPARK-44702:
-

 Summary: Cannot derive ExpressionEncoder when using types tagged 
by a trait
 Key: SPARK-44702
 URL: https://issues.apache.org/jira/browse/SPARK-44702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2, 3.2.4, 3.1.3
Reporter: Ben Hemsi


Steps to reproduce:

 
{code:java}
trait Tag
case class Foo(x: Int)
case class Bar(x: Foo with Tag)
ExpressionEncoder.apply[Bar](){code}
the ExpressionEncoder throws the following error (this was for 3.1.2):

 
{code:java}
scala.MatchError: Foo with Tag (of class 
scala.reflect.internal.Types$RefinedType0)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:931)
  at 
org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:928)
  at 
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:49)
 {code}
The bug is [on this line (on 
master)|https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L461].
{code:java}
val TypeRef(_, _, actualTypeArgs) = dealiasedTpe {code}
which is an incomplete pattern match. The pattern match needs be extended to 
include matching on RefinedType as well.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44669) Parquet/ORC files written using Hive Serde should has file extension

2023-08-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751548#comment-17751548
 ] 

ASF GitHub Bot commented on SPARK-44669:


User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/42336

> Parquet/ORC files written using Hive Serde should has file extension
> 
>
> Key: SPARK-44669
> URL: https://issues.apache.org/jira/browse/SPARK-44669
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-07 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-44700:
---
Description: 
_SQL_ like below:

select tmp.* 
 from
 (select
        device_id, ads_id, 
        from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', 
'"user_device_'), ${device_schema}) as tmp
        from input )

${device_schema} includes more than 100 fields.

if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
will be invoked many times, that costs so much time.

 

  was:
_SQL_ like below:

select tmp.* 
 from
 (select
        device_id, ads_id, 
        from_json(regexp_replace(device_personas, '(?<=(\{|,))"device_', 
'"user_device_'), ${device_schema}) as tmp
        from input )

${device_schema} has more than 100 fields.



if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
will be invoked many times, that costs so much time.

 


> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: jiahong.li
>Priority: Minor
>
> _SQL_ like below:
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} includes more than 100 fields.
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-07 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-44700:
---
Description: 
_SQL_ like below:

select tmp.* 
 from
 (select
        device_id, ads_id, 
        from_json(regexp_replace(device_personas, '(?<=(\{|,))"device_', 
'"user_device_'), ${device_schema}) as tmp
        from input )

${device_schema} has more than 100 fields.



if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
will be invoked many times, that costs so much time.

 

  was:
_SQL_ like below:

```

select tmp.* 
 from
 (select
        device_id, ads_id, 
        from_json(regexp_replace(device_personas, '(?<=({|,))"device_', 
'"user_device_'), ${device_schema}) as tmp
        from input )

${device_schema} has more than 100 fields
```

if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
will be invoked many times, that costs so much time.

 


> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: jiahong.li
>Priority: Minor
>
> _SQL_ like below:
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} has more than 100 fields.
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-07 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-44700:
---
Description: 
_SQL_ like below:

```

select tmp.* 
 from
 (select
        device_id, ads_id, 
        from_json(regexp_replace(device_personas, '(?<=({|,))"device_', 
'"user_device_'), ${device_schema}) as tmp
        from input )

${device_schema} has more than 100 fields
```

if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
will be invoked many times, that costs so much time.

 

  was:
_SQL_ like below:

```

select tmp.* 
 from
 (select
        device_id, ads_id, 
        from_json(regexp_replace(device_personas, '(?<=(\\\{|,))"device_', 
'"user_device_'), ${device_schema}) as tmp
        from input )

${device_schema} has more than 100 fields
```

if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
will be invoked many times, that costs so much time

 


> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: jiahong.li
>Priority: Minor
>
> _SQL_ like below:
> ```
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=({|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} has more than 100 fields
> ```
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-07 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-44700:
---
Description: 
_SQL_ like below:

```

select tmp.* 
 from
 (select
        device_id, ads_id, 
        from_json(regexp_replace(device_personas, '(?<=(\\\{|,))"device_', 
'"user_device_'), ${device_schema}) as tmp
        from input )

${device_schema} has more than 100 fields
```

if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
will be invoked many times, that costs so much time

 

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: jiahong.li
>Priority: Minor
>
> _SQL_ like below:
> ```
> select tmp.* 
>  from
>  (select
>         device_id, ads_id, 
>         from_json(regexp_replace(device_personas, '(?<=(\\\{|,))"device_', 
> '"user_device_'), ${device_schema}) as tmp
>         from input )
> ${device_schema} has more than 100 fields
> ```
> if Rule: OptimizeCsvJsonExprs  been applied, the expression, regexp_replace, 
> will be invoked many times, that costs so much time
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44701) Skip ClassificationTestsOnConnect when torch is not installed

2023-08-07 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-44701:
-

 Summary: Skip ClassificationTestsOnConnect when torch is not 
installed
 Key: SPARK-44701
 URL: https://issues.apache.org/jira/browse/SPARK-44701
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-07 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-44700:
---
Component/s: SQL

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: jiahong.li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-07 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-44700:
---
Component/s: (was: Spark Core)

> Rule OptimizeCsvJsonExprs should not be applied to expression like 
> from_json(regexp_replace)
> 
>
> Key: SPARK-44700
> URL: https://issues.apache.org/jira/browse/SPARK-44700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: jiahong.li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)

2023-08-07 Thread jiahong.li (Jira)
jiahong.li created SPARK-44700:
--

 Summary: Rule OptimizeCsvJsonExprs should not be applied to 
expression like from_json(regexp_replace)
 Key: SPARK-44700
 URL: https://issues.apache.org/jira/browse/SPARK-44700
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1, 3.4.0
Reporter: jiahong.li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44698) Create table like other table should also copy table stats.

2023-08-07 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu resolved SPARK-44698.

Resolution: Not A Problem

Sorry i misunderstand, the create table like don't need to copy data actually!

> Create table like other table should also copy table stats.
> ---
>
> Key: SPARK-44698
> URL: https://issues.apache.org/jira/browse/SPARK-44698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 4.0.0
>Reporter: Qi Zhu
>Priority: Major
>
> For example:
> describe table extended tbl;
> col0                    int
> col1                    int
> col2                    int
> col3                    int
> Detailed Table Information
> Catalog                 spark_catalog
> Database                default
> Table                   tbl
> Owner                   zhuqi
> Created Time            Mon Aug 07 14:02:30 CST 2023
> Last Access             UNKNOWN
> Created By              Spark 4.0.0-SNAPSHOT
> Type                    MANAGED
> Provider                hive
> Table Properties        [transient_lastDdlTime=1691388473]
> Statistics              30 bytes
> Location                
> [file:/Users/zhuqi/spark/spark/spark-warehouse/tbl|file:///Users/zhuqi/spark/spark/spark-warehouse/tbl]
> Serde Library           org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat             org.apache.hadoop.mapred.TextInputFormat
> OutputFormat            
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties      [serialization.format=1]
> Partition Provider      Catalog
> Time taken: 0.032 seconds, Fetched 23 row(s)
> create table tbl2 like tbl;
> 23/08/07 14:14:07 WARN HiveMetaStore: Location: 
> [file:/Users/zhuqi/spark/spark/spark-warehouse/tbl2|file:///Users/zhuqi/spark/spark/spark-warehouse/tbl2]
>  specified for non-external table:tbl2
> Time taken: 0.098 seconds
> spark-sql (default)> describe table extended tbl2;
> col0                    int
> col1                    int
> col2                    int
> col3                    int
> Detailed Table Information
> Catalog                 spark_catalog
> Database                default
> Table                   tbl2
> Owner                   zhuqi
> Created Time            Mon Aug 07 14:14:07 CST 2023
> Last Access             UNKNOWN
> Created By              Spark 4.0.0-SNAPSHOT
> Type                    MANAGED
> Provider                hive
> Table Properties        [transient_lastDdlTime=1691388847]
> Location                
> [file:/Users/zhuqi/spark/spark/spark-warehouse/tbl2|file:///Users/zhuqi/spark/spark/spark-warehouse/tbl2]
> Serde Library           org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat             org.apache.hadoop.mapred.TextInputFormat
> OutputFormat            
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties      [serialization.format=1]
> Partition Provider      Catalog
> Time taken: 0.03 seconds, Fetched 22 row(s)
> The table stats are missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >