from:"Willi Raschkowski \(Jira\)"

[jira] [Resolved] (SPARK-48371) Upgrade to Parquet 1.14

2024-05-21 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski resolved SPARK-48371.
---
Resolution: Duplicate

> Upgrade to Parquet 1.14
> ---
>
> Key: SPARK-48371
> URL: https://issues.apache.org/jira/browse/SPARK-48371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> There's a bug in Parquet 
> [(PARQUET-2454)|https://issues.apache.org/jira/browse/PARQUET-2454] where 
> Parquet in Spark occasionally writes out truncated files with bytes missing 
> at the end.
> The fix was released in Parquet 1.14.0. [See 
> changelog.|https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48371) Upgrade to Parquet 1.14

2024-05-21 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-48371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848166#comment-17848166
 ] 

Willi Raschkowski commented on SPARK-48371:
---

Apologies, I noticed we already have SPARK-48177.

> Upgrade to Parquet 1.14
> ---
>
> Key: SPARK-48371
> URL: https://issues.apache.org/jira/browse/SPARK-48371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> There's a bug in Parquet 
> [(PARQUET-2454)|https://issues.apache.org/jira/browse/PARQUET-2454] where 
> Parquet in Spark occasionally writes out truncated files with bytes missing 
> at the end.
> The fix was released in Parquet 1.14.0. [See 
> changelog.|https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48371) Upgrade to Parquet 1.14

2024-05-21 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-48371:
-

 Summary: Upgrade to Parquet 1.14
 Key: SPARK-48371
 URL: https://issues.apache.org/jira/browse/SPARK-48371
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.1
Reporter: Willi Raschkowski


There's a bug in Parquet 
[(PARQUET-2454)|https://issues.apache.org/jira/browse/PARQUET-2454] where 
Parquet in Spark occasionally writes out truncated files with bytes missing at 
the end.

The fix was released in Parquet 1.14.0. [See 
changelog.|https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47307) Spark 3.3 produces invalid base64

2024-03-06 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824115#comment-17824115
 ] 

Willi Raschkowski commented on SPARK-47307:
---

The behavior change is as follows:
 * Spark 3.2, 
[here|https://github.com/apache/spark/blob/e428fe902bb1f12cea973de7fe4b885ae69fd6ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2206],
 was using Apache's encoder like this: 
{{{}CommonsBase64.encodeBase64(bytes.asInstanceOf[Array[Byte]]){}}}.
 * That {{encodeBase64}} call does _not_ chunk [its 
output|https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64(byte%5B%5D,boolean,boolean,int)].
 * Falsely assuming that Apache's encoder would follow the RC2045 / MIME spec, 
Spark 3.3 started using [Java's MIME 
encoder|https://github.com/apache/spark/blob/f74867bddfbcdd4d08076db36851e88b15e66556/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2431].
 The MIME encoder [follows the RFC2045 spec and _does 
chunk_|https://datatracker.ietf.org/doc/html/rfc2045#section-6.8:~:text=76%0A%20%20%20%20%20%20%20%20%20%20characters%20long.].
* That chunking is what introduced those {{\r\n}} separators.
 

> Spark 3.3 produces invalid base64
> -
>
> Key: SPARK-47307
> URL: https://issues.apache.org/jira/browse/SPARK-47307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Blocker
>  Labels: correctness
>
> SPARK-37820 was introduced in Spark 3.3 and breaks behavior of {{base64}} 
> (which is fine but shouldn't happen between minor version).
> {code:title=Spark 3.2}
> >>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0]
> 'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYQ=='
> {code}
> Note the different output in Spark 3.3 (the addition of {{\r\n}} newlines).
> {code:title=Spark 3.3}
> >>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0]
> 'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh\r\nYQ=='
> {code}
> The former decodes fine with the {{base64}} on my machine but the latter does 
> not:
> {code}
> $ pbpaste | base64 --decode
> aa%
> $ pbpaste | base64 --decode
> base64: stdin: (null): error decoding base64 input stream
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47307) Spark 3.3 breaks base64

2024-03-06 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-47307:
--
Description: 
SPARK-37820 was introduced in Spark 3.3 and breaks behavior of {{base64}} 
(which is fine but shouldn't happen between minor version).

{code:title=Spark 3.2}
>>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0]
'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYQ=='
{code}

Note the different output in Spark 3.3 (the addition of {{\r\n}} newlines).

{code:title=Spark 3.3}
>>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0]
'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh\r\nYQ=='
{code}

The former decodes fine with the {{base64}} on my machine but the latter does 
not:
{code}
$ pbpaste | base64 --decode
aa%

$ pbpaste | base64 --decode
base64: stdin: (null): error decoding base64 input stream
{code}

  was:
SPARK-37820 was introduced in Spark 3.3 and breaks behavior of {{base64}} 
(which is fine but shouldn't happen between minor version).

{code:title=Spark 3.2}
In [1]: lorem = """
   ...: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac 
laoreet metus. Curabitur sollicitudin magna ac lacinia ornare. Pellentesque 
semper elit nunc, vestibulum ultricies elit bibendum sed. Praesent vehicula 
sodales odio, tincidunt laoreet diam laoreet non. Mauris condimentum lacinia 
laoreet. Mauris ultrices urna ut sapien dictum commodo faucibu
   ...: s nec nisl. Nulla mattis tincidunt orci eget semper. Etiam dignissim 
finibus mi et lacinia. Curabitur vitae sem commodo, euismod nisl at, molestie 
tortor. Quisque ornare, tortor a vulputate molestie, augue lectus blandit erat, 
nec efficitur justo metus ut dui. Morbi purus lectus, accumsan vitae sem vitae, 
faucibus aliquet quam. Donec euismod, nulla a por
   ...: ta hendrerit, lorem magna vestibulum nunc, et eleifend quam metus quis 
purus.
   ...: 
   ...: Praesent id velit scelerisque, varius eros ac, cursus quam. Duis mollis 
facilisis ante a dictum. Nunc nisl sem, fermentum non sagittis non, convallis 
nec lectus. Praesent nec nulla sed velit interdum tristique sit amet non nisl. 
Pellentesque rhoncus libero urna, eget condimentum orci tristique in. Donec a 
felis eu nisl laoreet efficitur. Integer velit ju
   ...: sto, elementum a faucibus ac, fringilla ac nibh.
   ...: """

In [2]: spark.sql(f"""SELECT base64('{lorem}') AS base64""").collect()[0][0]
Out[2]: 
'CkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0LCBjb25zZWN0ZXR1ciBhZGlwaXNjaW5nIGVsaXQuIE51bmMgYWMgbGFvcmVldCBtZXR1cy4gQ3VyYWJpdHVyIHNvbGxpY2l0dWRpbiBtYWduYSBhYyBsYWNpbmlhIG9ybmFyZS4gUGVsbGVudGVzcXVlIHNlbXBlciBlbGl0IG51bmMsIHZlc3RpYnVsdW0gdWx0cmljaWVzIGVsaXQgYmliZW5kdW0gc2VkLiBQcmFlc2VudCB2ZWhpY3VsYSBzb2RhbGVzIG9kaW8sIHRpbmNpZHVudCBsYW9yZWV0IGRpYW0gbGFvcmVldCBub24uIE1hdXJpcyBjb25kaW1lbnR1bSBsYWNpbmlhIGxhb3JlZXQuIE1hdXJpcyB1bHRyaWNlcyB1cm5hIHV0IHNhcGllbiBkaWN0dW0gY29tbW9kbyBmYXVjaWJ1cyBuZWMgbmlzbC4gTnVsbGEgbWF0dGlzIHRpbmNpZHVudCBvcmNpIGVnZXQgc2VtcGVyLiBFdGlhbSBkaWduaXNzaW0gZmluaWJ1cyBtaSBldCBsYWNpbmlhLiBDdXJhYml0dXIgdml0YWUgc2VtIGNvbW1vZG8sIGV1aXNtb2QgbmlzbCBhdCwgbW9sZXN0aWUgdG9ydG9yLiBRdWlzcXVlIG9ybmFyZSwgdG9ydG9yIGEgdnVscHV0YXRlIG1vbGVzdGllLCBhdWd1ZSBsZWN0dXMgYmxhbmRpdCBlcmF0LCBuZWMgZWZmaWNpdHVyIGp1c3RvIG1ldHVzIHV0IGR1aS4gTW9yYmkgcHVydXMgbGVjdHVzLCBhY2N1bXNhbiB2aXRhZSBzZW0gdml0YWUsIGZhdWNpYnVzIGFsaXF1ZXQgcXVhbS4gRG9uZWMgZXVpc21vZCwgbnVsbGEgYSBwb3J0YSBoZW5kcmVyaXQsIGxvcmVtIG1hZ25hIHZlc3RpYnVsdW0gbnVuYywgZXQgZWxlaWZlbmQgcXVhbSBtZXR1cyBxdWlzIHB1cnVzLgoKUHJhZXNlbnQgaWQgdmVsaXQgc2NlbGVyaXNxdWUsIHZhcml1cyBlcm9zIGFjLCBjdXJzdXMgcXVhbS4gRHVpcyBtb2xsaXMgZmFjaWxpc2lzIGFudGUgYSBkaWN0dW0uIE51bmMgbmlzbCBzZW0sIGZlcm1lbnR1bSBub24gc2FnaXR0aXMgbm9uLCBjb252YWxsaXMgbmVjIGxlY3R1cy4gUHJhZXNlbnQgbmVjIG51bGxhIHNlZCB2ZWxpdCBpbnRlcmR1bSB0cmlzdGlxdWUgc2l0IGFtZXQgbm9uIG5pc2wuIFBlbGxlbnRlc3F1ZSByaG9uY3VzIGxpYmVybyB1cm5hLCBlZ2V0IGNvbmRpbWVudHVtIG9yY2kgdHJpc3RpcXVlIGluLiBEb25lYyBhIGZlbGlzIGV1IG5pc2wgbGFvcmVldCBlZmZpY2l0dXIuIEludGVnZXIgdmVsaXQganVzdG8sIGVsZW1lbnR1bSBhIGZhdWNpYnVzIGFjLCBmcmluZ2lsbGEgYWMgbmliaC4K'
{code}

Note the different output in Spark 3.3 (the addition of {{\r\n}} newlines).

{code:title=Spark 3.3}
In [1]: lorem = """
   ...: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac 
laoreet metus. Curabitur sollicitudin magna ac lacinia ornare. Pellentesque 
semper elit nunc, vestibulum ultricies elit bibendum sed. Praesent vehicula 
sodales odio, tincidunt laoreet diam laoreet non. Mauris condimentum lacinia 
laoreet. Mauris ultrices urna ut sapien dictum commodo faucibu
   ...: s nec nisl. Nulla mattis tincidunt orci eget semper. Etiam dignissim 
finibus mi et lacinia. Curabitur vitae sem commodo, euismod nisl at, molestie 
tortor. Quisque ornare, tortor a vulputate molestie, augue lectus blandit erat, 
nec efficitur justo met

[jira] [Created] (SPARK-47307) Spark 3.3 breaks base64

2024-03-06 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-47307:
-

Summary: Spark 3.3 breaks base64
Key: SPARK-47307
URL: https://issues.apache.org/jira/browse/SPARK-47307
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.3.0
Reporter: Willi Raschkowski

SPARK-37820 was introduced in Spark 3.3 and breaks behavior of {{base64}}
(which is fine but shouldn't happen between minor version).

{code:title=Spark 3.2}
In [1]: lorem = """
...: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac
laoreet metus. Curabitur sollicitudin magna ac lacinia ornare. Pellentesque
semper elit nunc, vestibulum ultricies elit bibendum sed. Praesent vehicula
sodales odio, tincidunt laoreet diam laoreet non. Mauris condimentum lacinia
laoreet. Mauris ultrices urna ut sapien dictum commodo faucibu
...: s nec nisl. Nulla mattis tincidunt orci eget semper. Etiam dignissim
finibus mi et lacinia. Curabitur vitae sem commodo, euismod nisl at, molestie
tortor. Quisque ornare, tortor a vulputate molestie, augue lectus blandit erat,
nec efficitur justo metus ut dui. Morbi purus lectus, accumsan vitae sem vitae,
faucibus aliquet quam. Donec euismod, nulla a por
...: ta hendrerit, lorem magna vestibulum nunc, et eleifend quam metus quis
purus.
...:
...: Praesent id velit scelerisque, varius eros ac, cursus quam. Duis mollis
facilisis ante a dictum. Nunc nisl sem, fermentum non sagittis non, convallis
nec lectus. Praesent nec nulla sed velit interdum tristique sit amet non nisl.
Pellentesque rhoncus libero urna, eget condimentum orci tristique in. Donec a
felis eu nisl laoreet efficitur. Integer velit ju
...: sto, elementum a faucibus ac, fringilla ac nibh.
...: """

In [2]: spark.sql(f"""SELECT base64('{lorem}') AS base64""").collect()[0][0]
Out[2]:
'CkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0LCBjb25zZWN0ZXR1ciBhZGlwaXNjaW5nIGVsaXQuIE51bmMgYWMgbGFvcmVldCBtZXR1cy4gQ3VyYWJpdHVyIHNvbGxpY2l0dWRpbiBtYWduYSBhYyBsYWNpbmlhIG9ybmFyZS4gUGVsbGVudGVzcXVlIHNlbXBlciBlbGl0IG51bmMsIHZlc3RpYnVsdW0gdWx0cmljaWVzIGVsaXQgYmliZW5kdW0gc2VkLiBQcmFlc2VudCB2ZWhpY3VsYSBzb2RhbGVzIG9kaW8sIHRpbmNpZHVudCBsYW9yZWV0IGRpYW0gbGFvcmVldCBub24uIE1hdXJpcyBjb25kaW1lbnR1bSBsYWNpbmlhIGxhb3JlZXQuIE1hdXJpcyB1bHRyaWNlcyB1cm5hIHV0IHNhcGllbiBkaWN0dW0gY29tbW9kbyBmYXVjaWJ1cyBuZWMgbmlzbC4gTnVsbGEgbWF0dGlzIHRpbmNpZHVudCBvcmNpIGVnZXQgc2VtcGVyLiBFdGlhbSBkaWduaXNzaW0gZmluaWJ1cyBtaSBldCBsYWNpbmlhLiBDdXJhYml0dXIgdml0YWUgc2VtIGNvbW1vZG8sIGV1aXNtb2QgbmlzbCBhdCwgbW9sZXN0aWUgdG9ydG9yLiBRdWlzcXVlIG9ybmFyZSwgdG9ydG9yIGEgdnVscHV0YXRlIG1vbGVzdGllLCBhdWd1ZSBsZWN0dXMgYmxhbmRpdCBlcmF0LCBuZWMgZWZmaWNpdHVyIGp1c3RvIG1ldHVzIHV0IGR1aS4gTW9yYmkgcHVydXMgbGVjdHVzLCBhY2N1bXNhbiB2aXRhZSBzZW0gdml0YWUsIGZhdWNpYnVzIGFsaXF1ZXQgcXVhbS4gRG9uZWMgZXVpc21vZCwgbnVsbGEgYSBwb3J0YSBoZW5kcmVyaXQsIGxvcmVtIG1hZ25hIHZlc3RpYnVsdW0gbnVuYywgZXQgZWxlaWZlbmQgcXVhbSBtZXR1cyBxdWlzIHB1cnVzLgoKUHJhZXNlbnQgaWQgdmVsaXQgc2NlbGVyaXNxdWUsIHZhcml1cyBlcm9zIGFjLCBjdXJzdXMgcXVhbS4gRHVpcyBtb2xsaXMgZmFjaWxpc2lzIGFudGUgYSBkaWN0dW0uIE51bmMgbmlzbCBzZW0sIGZlcm1lbnR1bSBub24gc2FnaXR0aXMgbm9uLCBjb252YWxsaXMgbmVjIGxlY3R1cy4gUHJhZXNlbnQgbmVjIG51bGxhIHNlZCB2ZWxpdCBpbnRlcmR1bSB0cmlzdGlxdWUgc2l0IGFtZXQgbm9uIG5pc2wuIFBlbGxlbnRlc3F1ZSByaG9uY3VzIGxpYmVybyB1cm5hLCBlZ2V0IGNvbmRpbWVudHVtIG9yY2kgdHJpc3RpcXVlIGluLiBEb25lYyBhIGZlbGlzIGV1IG5pc2wgbGFvcmVldCBlZmZpY2l0dXIuIEludGVnZXIgdmVsaXQganVzdG8sIGVsZW1lbnR1bSBhIGZhdWNpYnVzIGFjLCBmcmluZ2lsbGEgYWMgbmliaC4K'
{code}

Note the different output in Spark 3.3 (the addition of {{\r\n}} newlines).

{code:title=Spark 3.3}
In [1]: lorem = """
...: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac
laoreet metus. Curabitur sollicitudin magna ac lacinia ornare. Pellentesque
semper elit nunc, vestibulum ultricies elit bibendum sed. Praesent vehicula
sodales odio, tincidunt laoreet diam laoreet non. Mauris condimentum lacinia
laoreet. Mauris ultrices urna ut sapien dictum commodo faucibu
...: s nec nisl. Nulla mattis tincidunt orci eget semper. Etiam dignissim
finibus mi et lacinia. Curabitur vitae sem commodo, euismod nisl at, molestie
tortor. Quisque ornare, tortor a vulputate molestie, augue lectus blandit erat,
nec efficitur justo metus ut dui. Morbi purus lectus, accumsan vitae sem vitae,
faucibus aliquet quam. Donec euismod, nulla a por
...: ta hendrerit, lorem magna vestibulum nunc, et eleifend quam metus quis
purus.
...:
...: Praesent id velit scelerisque, varius eros ac, cursus quam. Duis mollis
facilisis ante a dictum. Nunc nisl sem, fermentum non sagittis non, convallis
nec lectus. Praesent nec nulla sed velit interdum tristique sit amet non nisl.
Pellentesque rhoncus libero urna, eget condimentum orci tristique in. Donec a
felis eu nisl laoreet efficitur. Integer velit ju
...: sto, elementum a faucibus ac, fringilla ac nibh.
...: """

In [2]: spark.sql(f"""SELECT base64('{lorem}') AS base64""").collect()[0][0]

[jira] [Comment Edited] (SPARK-46893) Remove inline scripts from UI descriptions

2024-01-29 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811850#comment-17811850
 ] 

Willi Raschkowski edited comment on SPARK-46893 at 1/29/24 12:15 PM:
-

cc [~dongjoon], for your awareness as PMC who's recently touched the UI.

I'm wondering if we should file a CVE for this.


was (Author: raschkowski):
[~dongjoon], for your awareness as PMC who's recently touched the UI.

I'm wondering if we should file a CVE for this.

> Remove inline scripts from UI descriptions
> --
>
> Key: SPARK-46893
> URL: https://issues.apache.org/jira/browse/SPARK-46893
> Project: Spark
>  Issue Type: Bug
>  Components: UI, Web UI
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot 
> 2024-01-29 at 09.06.34.png
>
>
> Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} 
> handlers) in the UI job and stage descriptions.
> The UI already has precaution to treat, e.g., {{

[jira] [Commented] (SPARK-46893) Remove inline scripts from UI descriptions

2024-01-29 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811850#comment-17811850
 ] 

Willi Raschkowski commented on SPARK-46893:
---

[~dongjoon], for your awareness as PMC who's recently touched the UI.

I'm wondering if we should file a CVE for this.

> Remove inline scripts from UI descriptions
> --
>
> Key: SPARK-46893
> URL: https://issues.apache.org/jira/browse/SPARK-46893
> Project: Spark
>  Issue Type: Bug
>  Components: UI, Web UI
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot 
> 2024-01-29 at 09.06.34.png
>
>
> Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} 
> handlers) in the UI job and stage descriptions.
> The UI already has precaution to treat, e.g., {{

[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions

2024-01-29 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-46893:
--
Description: 
Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} handlers) 
in the UI job and stage descriptions.

The UI already has precaution to treat, e.g., {{

[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions

2024-01-29 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-46893:
--
Description: 
Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} handlers) 
in the UI job and stage descriptions.

The UI already has precaution to treat, e.g., {{

[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions

2024-01-29 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-46893:
--
Attachment: Screenshot 2024-01-29 at 09.06.34.png

> Remove inline scripts from UI descriptions
> --
>
> Key: SPARK-46893
> URL: https://issues.apache.org/jira/browse/SPARK-46893
> Project: Spark
>  Issue Type: Bug
>  Components: UI, Web UI
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot 
> 2024-01-29 at 09.06.34.png
>
>
> Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} 
> handlers) in the UI job and stage descriptions.
> The UI already has precaution to treat, e.g., {{

[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions

2024-01-29 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-46893:
--
Summary: Remove inline scripts from UI descriptions  (was: Sanitize UI 
descriptions from inline scripts)

> Remove inline scripts from UI descriptions
> --
>
> Key: SPARK-46893
> URL: https://issues.apache.org/jira/browse/SPARK-46893
> Project: Spark
>  Issue Type: Bug
>  Components: UI, Web UI
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Recording 2024-01-28 at 17.51.47.mov
>
>
> Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} 
> handlers) in the UI job and stage descriptions.
> The UI already has precaution to treat, e.g., {{

[jira] [Created] (SPARK-46893) Sanitize UI descriptions from inline scripts

2024-01-28 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-46893:
-

 Summary: Sanitize UI descriptions from inline scripts
 Key: SPARK-46893
 URL: https://issues.apache.org/jira/browse/SPARK-46893
 Project: Spark
  Issue Type: Bug
  Components: UI, Web UI
Affects Versions: 3.4.1
Reporter: Willi Raschkowski
 Attachments: Screen Recording 2024-01-28 at 17.51.47.mov

Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} handlers) 
in the UI job and stage descriptions.

The UI already has precaution to treat, e.g., {{

[jira] [Updated] (SPARK-46893) Sanitize UI descriptions from inline scripts

2024-01-28 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-46893:
--
Attachment: Screen Recording 2024-01-28 at 17.51.47.mov

> Sanitize UI descriptions from inline scripts
> 
>
> Key: SPARK-46893
> URL: https://issues.apache.org/jira/browse/SPARK-46893
> Project: Spark
>  Issue Type: Bug
>  Components: UI, Web UI
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Recording 2024-01-28 at 17.51.47.mov
>
>
> Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} 
> handlers) in the UI job and stage descriptions.
> The UI already has precaution to treat, e.g., {{

[jira] [Commented] (SPARK-44767) Plugin API for PySpark and SparkR workers

2023-11-21 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788360#comment-17788360
 ] 

Willi Raschkowski commented on SPARK-44767:
---

[~gurwls223], curious what you think about this proposal. I know you're leaning 
towards dynamic environment selection for Spark Connect 
[(apache/spark#41215)|https://github.com/apache/spark/pull/41215] instead of 
relying on a single environment per Spark application or per host. 

At Palantir, we use conda-pack based environments with {{spark.archives}}. But 
that wasn't sufficient to make native library dependencies work. Internally, we 
implemented a {{ProcessBuilder}} plugin (using the [proposed 
API|https://github.com/apache/spark/pull/42440]). Among other things we use it 
to append the environment's {{bin/}} to the process' {{PATH}} variable or to 
discover Python module and non-Python binary locations outside the packaged 
environment.

> Plugin API for PySpark and SparkR workers
> -
>
> Key: SPARK-44767
> URL: https://issues.apache.org/jira/browse/SPARK-44767
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: pull-request-available
>
> An API to customize Python and R workers allows for extensibility beyond what 
> can be expressed via static configs and environment variables like, e.g., 
> {{spark.pyspark.python}}.
> A use case for this is overriding {{PATH}} when using {{spark.archives}} 
> with, say, conda-pack (as documented 
> [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
>  Some packages rely on binaries. And if we want to use those packages in 
> Spark, we need to include their binaries in the {{PATH}}.
> But we can't set the {{PATH}} via some config because 1) the environment with 
> its binaries may be at a dynamic location (archives are unpacked on the 
> driver [into a directory with random 
> name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
>  and 2) we may not want to override the {{PATH}} that's pre-configured on the 
> hosts.
> Other use cases unlocked by this include overriding the executable 
> dynamically (e.g., to select a version) or forking/redirecting the worker's 
> output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44767) Plugin API for PySpark and SparkR workers

2023-08-10 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752926#comment-17752926
 ] 

Willi Raschkowski commented on SPARK-44767:
---

I put up a proposal implementation here: 
https://github.com/apache/spark/pull/42440

> Plugin API for PySpark and SparkR workers
> -
>
> Key: SPARK-44767
> URL: https://issues.apache.org/jira/browse/SPARK-44767
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> An API to customize Python and R workers allows for extensibility beyond what 
> can be expressed via static configs and environment variables like, e.g., 
> {{spark.pyspark.python}}.
> A use case for this is overriding {{PATH}} when using {{spark.archives}} 
> with, say, conda-pack (as documented 
> [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
>  Some packages rely on binaries. And if we want to use those packages in 
> Spark, we need to include their binaries in the {{PATH}}.
> But we can't set the {{PATH}} via some config because 1) the environment with 
> its binaries may be at a dynamic location (archives are unpacked on the 
> driver [into a directory with random 
> name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
>  and 2) we may not want to override the {{PATH}} that's pre-configured on the 
> hosts.
> Other use cases unlocked by this include overriding the executable 
> dynamically (e.g., to select a version) or forking/redirecting the worker's 
> output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44767) Plugin API for PySpark and SparkR workers

2023-08-10 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-44767:
--
Summary: Plugin API for PySpark and SparkR workers  (was: Plugin API for 
PySpark and SparkR subprocesses)

> Plugin API for PySpark and SparkR workers
> -
>
> Key: SPARK-44767
> URL: https://issues.apache.org/jira/browse/SPARK-44767
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> An API to customize Python and R workers allows for extensibility beyond what 
> can be expressed via static configs and environment variables like, e.g., 
> {{spark.pyspark.python}}.
> A use case for this is overriding {{PATH}} when using {{spark.archives}} 
> with, say, conda-pack (as documented 
> [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
>  Some packages rely on binaries. And if we want to use those packages in 
> Spark, we need to include their binaries in the {{PATH}}.
> But we can't set the {{PATH}} via some config because 1) the environment with 
> its binaries may be at a dynamic location (archives are unpacked on the 
> driver [into a directory with random 
> name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
>  and 2) we may not want to override the {{PATH}} that's pre-configured on the 
> hosts.
> Other use cases unlocked by this include overriding the executable 
> dynamically (e.g., to select a version) or forking/redirecting the worker's 
> output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44767) Plugin API for PySpark and SparkR subprocesses

2023-08-10 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-44767:
-

 Summary: Plugin API for PySpark and SparkR subprocesses
 Key: SPARK-44767
 URL: https://issues.apache.org/jira/browse/SPARK-44767
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Willi Raschkowski


An API to customize Python and R workers allows for extensibility beyond what 
can be expressed via static configs and environment variables like, e.g., 
{{spark.pyspark.python}}.

A use case we had for this is overriding {{PATH}} when using {{spark.archives}} 
with, say, conda-pack (as documented 
[here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
 Some packages rely on binaries. And if we want to use those packages in Spark, 
we need to include their binaries in the {{PATH}}.

But we can't set the {{PATH}} via some config because 1) the environment with 
its binaries may be at a dynamic location (archives are unpacked on the driver 
[into a directory with random 
name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
 and 2) we may not want to override the {{PATH}} that's pre-configured on the 
hosts.

Other use cases unlocked by this include overriding the executable dynamically 
(e.g., to select a version) or forking/redirecting the worker's output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44767) Plugin API for PySpark and SparkR subprocesses

2023-08-10 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-44767:
--
Description: 
An API to customize Python and R workers allows for extensibility beyond what 
can be expressed via static configs and environment variables like, e.g., 
{{spark.pyspark.python}}.

A use case for this is overriding {{PATH}} when using {{spark.archives}} with, 
say, conda-pack (as documented 
[here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
 Some packages rely on binaries. And if we want to use those packages in Spark, 
we need to include their binaries in the {{PATH}}.

But we can't set the {{PATH}} via some config because 1) the environment with 
its binaries may be at a dynamic location (archives are unpacked on the driver 
[into a directory with random 
name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
 and 2) we may not want to override the {{PATH}} that's pre-configured on the 
hosts.

Other use cases unlocked by this include overriding the executable dynamically 
(e.g., to select a version) or forking/redirecting the worker's output stream.

  was:
An API to customize Python and R workers allows for extensibility beyond what 
can be expressed via static configs and environment variables like, e.g., 
{{spark.pyspark.python}}.

A use case we had for this is overriding {{PATH}} when using {{spark.archives}} 
with, say, conda-pack (as documented 
[here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
 Some packages rely on binaries. And if we want to use those packages in Spark, 
we need to include their binaries in the {{PATH}}.

But we can't set the {{PATH}} via some config because 1) the environment with 
its binaries may be at a dynamic location (archives are unpacked on the driver 
[into a directory with random 
name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
 and 2) we may not want to override the {{PATH}} that's pre-configured on the 
hosts.

Other use cases unlocked by this include overriding the executable dynamically 
(e.g., to select a version) or forking/redirecting the worker's output stream.


> Plugin API for PySpark and SparkR subprocesses
> --
>
> Key: SPARK-44767
> URL: https://issues.apache.org/jira/browse/SPARK-44767
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> An API to customize Python and R workers allows for extensibility beyond what 
> can be expressed via static configs and environment variables like, e.g., 
> {{spark.pyspark.python}}.
> A use case for this is overriding {{PATH}} when using {{spark.archives}} 
> with, say, conda-pack (as documented 
> [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
>  Some packages rely on binaries. And if we want to use those packages in 
> Spark, we need to include their binaries in the {{PATH}}.
> But we can't set the {{PATH}} via some config because 1) the environment with 
> its binaries may be at a dynamic location (archives are unpacked on the 
> driver [into a directory with random 
> name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
>  and 2) we may not want to override the {{PATH}} that's pre-configured on the 
> hosts.
> Other use cases unlocked by this include overriding the executable 
> dynamically (e.g., to select a version) or forking/redirecting the worker's 
> output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43142) DSL expressions fail on attribute with special characters

2023-04-14 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712395#comment-17712395
 ] 

Willi Raschkowski commented on SPARK-43142:
---

https://github.com/apache/spark/pull/40794

> DSL expressions fail on attribute with special characters
> -
>
> Key: SPARK-43142
> URL: https://issues.apache.org/jira/browse/SPARK-43142
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Expressions on implicitly converted attributes fail if the attributes have 
> names containing special characters. They fail even if the attributes are 
> backtick-quoted:
> {code:java}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> "`slashed/col`".attr
> res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 
> 'slashed/col
> scala> "`slashed/col`".attr.asc
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '/' expecting {, '.', '-'}(line 1, pos 7)
> == SQL ==
> slashed/col
> ---^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43142) DSL expressions fail on attribute with special characters

2023-04-14 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712383#comment-17712383
 ] 

Willi Raschkowski commented on SPARK-43142:
---

The solution I'd propose is to have {{DslAttr.attr}} return the attribute it's 
wrapping instead of creating a new attribute.

> DSL expressions fail on attribute with special characters
> -
>
> Key: SPARK-43142
> URL: https://issues.apache.org/jira/browse/SPARK-43142
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Expressions on implicitly converted attributes fail if the attributes have 
> names containing special characters. They fail even if the attributes are 
> backtick-quoted:
> {code:java}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> "`slashed/col`".attr
> res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 
> 'slashed/col
> scala> "`slashed/col`".attr.asc
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '/' expecting {, '.', '-'}(line 1, pos 7)
> == SQL ==
> slashed/col
> ---^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43142) DSL expressions fail on attribute with special characters

2023-04-14 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712383#comment-17712383
 ] 

Willi Raschkowski edited comment on SPARK-43142 at 4/14/23 1:18 PM:


The solution I'd propose is to have {{DslAttr.attr}} return the attribute it's 
wrapping instead of creating a new attribute.

I'll put up a PR.


was (Author: raschkowski):
The solution I'd propose is to have {{DslAttr.attr}} return the attribute it's 
wrapping instead of creating a new attribute.

> DSL expressions fail on attribute with special characters
> -
>
> Key: SPARK-43142
> URL: https://issues.apache.org/jira/browse/SPARK-43142
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Expressions on implicitly converted attributes fail if the attributes have 
> names containing special characters. They fail even if the attributes are 
> backtick-quoted:
> {code:java}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> "`slashed/col`".attr
> res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 
> 'slashed/col
> scala> "`slashed/col`".attr.asc
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '/' expecting {, '.', '-'}(line 1, pos 7)
> == SQL ==
> slashed/col
> ---^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43142) DSL expressions fail on attribute with special characters

2023-04-14 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712382#comment-17712382
 ] 

Willi Raschkowski commented on SPARK-43142:
---

Here's what's happening: {{ImplicitOperators}} methods like {{asc}} rely on a 
call to {{expr}} 
[(Github)|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L149].
 

The {{UnresolvedAttribute}} returned by {{.attr}} is implicitly converted to 
{{DslAttr}}. But {{DslAttr}} does not implement {{expr}} by returning the 
attribute it's already wrapping. Instead, it only implements how to convert the 
attribute it's wrapping to a string name 
[(Github)|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L273-L275].

Returning an attribute for an implicitly wrapped attribute is implemented on 
the super class {{ImplicitAttribute}} by creating a new {{UnresolvedAttribute}} 
on the string name return by {{DslAttr}} (the method call {{s}}, 
[Github|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L278-L280]).

The problem is that this string name returned by {{DslAttr}} no longer has the 
quotes and thus the new {{UnresolvedAttribute}} parses an unquoted identifier.

{code}
scala> "`col/slash`".attr.name
res1: String = col/slash
{code}

> DSL expressions fail on attribute with special characters
> -
>
> Key: SPARK-43142
> URL: https://issues.apache.org/jira/browse/SPARK-43142
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Expressions on implicitly converted attributes fail if the attributes have 
> names containing special characters. They fail even if the attributes are 
> backtick-quoted:
> {code:java}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> "`slashed/col`".attr
> res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 
> 'slashed/col
> scala> "`slashed/col`".attr.asc
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '/' expecting {, '.', '-'}(line 1, pos 7)
> == SQL ==
> slashed/col
> ---^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43142) DSL expressions fail on attribute with special characters

2023-04-14 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-43142:
-

 Summary: DSL expressions fail on attribute with special characters
 Key: SPARK-43142
 URL: https://issues.apache.org/jira/browse/SPARK-43142
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Willi Raschkowski


Expressions on implicitly converted attributes fail if the attributes have 
names containing special characters. They fail even if the attributes are 
backtick-quoted:
{code:java}
scala> import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.expressions._

scala> "`slashed/col`".attr
res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'slashed/col

scala> "`slashed/col`".attr.asc
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '/' expecting {, '.', '-'}(line 1, pos 7)

== SQL ==
slashed/col
---^^^
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35324) Spark SQL configs not respected in RDDs

2023-04-06 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17709352#comment-17709352
 ] 

Willi Raschkowski commented on SPARK-35324:
---

I understand this better now:
 * When calling {{SQLConf.get}} on executors, the configs are read from the 
local properties on the {{{}TaskContext{}}}. The local properties are populated 
driver-side when scheduling the job, using the properties found in 
{{{}sparkContext.localProperties{}}}.
 * For RDD actions like {{{}rdd.count{}}}, nothing populates moves driver-side 
SQL configs into the SparkContext's local properites.
 * For datasets, all actions incl. {{{}dataset.count{}}}, are wrapped in an 
{{withAction}} call 
[(e.g.)|https://github.com/apache/spark/blob/v3.3.2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3160].
* {{withAction}}  wraps the action in {{SQLExecution.withNewExecutionId}}, 
which in turn wraps it in {{SQLExecution.withSQLConfPropagated}}. This latter 
method copies SQL configs into the SparkContext's local properties.

So in summary, all actions on datasets get wrapped in {{withSQLConfPropagated}} 
while actions on RDDs aren't. That's why {{df.count}} works but 
{{df.rdd.count}} doesn't. With {{count}} the answer is to just use 
{{Dataset.count}}. But, e.g., {{df.toLocalIterator}} has no alternative.

To fix this, Spark would have to always copy configs into local properties 
(e.g. in {{submitJob}}). If maintainers likes that, I'll put up a PR. If not, 
feel free to close.

In the meantime, my work-around is to call 
{{SQLExecution.withSQLConfPropagated}} myself.

{code}
scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")

scala> spark.read.schema("date date").option("dateFormat", 
"MM/dd/yy").csv(Seq("2/6/18").toDS()).toLocalIterator.next
23/04/06 13:58:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkUpgradeException: 
[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER]
...

scala> SQLExecution.withSQLConfPropagated(spark) { 
 |   spark.read.schema("date date").option("dateFormat", 
"MM/dd/yy").csv(Seq("2/6/18").toDS()).toLocalIterator.next
 | }
res2: org.apache.spark.sql.Row = [2018-02-06]
{code}

> Spark SQL configs not respected in RDDs
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot 
> 2021-05-06 at 00.35.10.png
>
>
> When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in 
> actions on the resulting dataframe. But it's ignored in actions on 
> dataframe's RDD.
> E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to 
> be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as 
> you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the 
> override and read the config value as {{EXCEPTION}}.
> For instance:
> {code:java|title=test.csv}
> date
> 2/6/18
> {code}
> {code:java}
> scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
> scala> val df = {
>  |   spark.read
>  | .option("header", "true")
>  | .option("dateFormat", "MM/dd/yy")
>  | .schema("date date")
>  | .csv("/Users/wraschkowski/Downloads/test.csv")
>  | }
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> +--+  
>   
> |  date|
> +--+
> |2018-02-06|
> +--+
> scala> df.count
> res3: Long = 1
> scala> df.rdd.count
> 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
> set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
> before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime 
> string.
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
>   at 
> scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spar

[jira] [Created] (SPARK-42373) Remove unused blank line removal from CSVExprUtils

2023-02-07 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-42373:
-

 Summary: Remove unused blank line removal from CSVExprUtils
 Key: SPARK-42373
 URL: https://issues.apache.org/jira/browse/SPARK-42373
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.1
Reporter: Willi Raschkowski


The non-multiline CSV read codepath contains references to removal of blank 
lines throughout. This is not necessary as blank lines are removed by the 
parser. Furthermore, it causes confusion, indicating that blank lines are 
removed at this point when instead they are already omitted from the data. The 
multiline code-path does not explicitly remove blank lines leading to what 
looks like disparity in behavior between the two.

The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need to 
explicitly skip lines, and this should be respected in {{CSVUtils}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42359) Support row skipping when reading CSV files

2023-02-06 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684689#comment-17684689
 ] 

Willi Raschkowski commented on SPARK-42359:
---

This repeats SPARK-26406 but it's worth reconsidering now that SQL / DataFrame 
APIs established themselves as "preferred" way to interact with Spark and 
platforms like Databricks SQL increase collaboration with less-technical users. 
Meanwhile, the RDD and {{zipWithIndex}} workaround is awkward because it 
implies some ordering that can't be assumed at the RDD-level but the datasource 
_can_ assume at the CSV-level.

> Support row skipping when reading CSV files
> ---
>
> Key: SPARK-42359
> URL: https://issues.apache.org/jira/browse/SPARK-42359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screenshot 2023-02-06 at 13.23.34.png
>
>
> Spark currently can't read CSV files that contain lines with comments or 
> annotations above the header and data. Work-arounds include pre-processing 
> CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these 
> increase friction for less technical users.
> This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a 
> number of unwanted lines at the top of a CSV file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42359) Support row skipping when reading CSV files

2023-02-06 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684683#comment-17684683
 ] 

Willi Raschkowski commented on SPARK-42359:
---

In our experience such CSV files tend to be Excel exports where users like to 
populate rows above the header with descriptions of the data.

To give a real-world example: [here's a dataset made available by the UK 
government 
(data.gov.uk)|https://www.data.gov.uk/dataset/9003012e-4564-4a6b-b5f0-8765ccb23a03/average-road-fuel-sales-deliveries-and-stock-levels].
 The dataset is only available via Excel files that look like this:

!Screenshot 2023-02-06 at 13.23.34.png!

Exporting from Excel for consumption in Spark results in a CSV that looks like 
this:
{code}
cat 
~/Downloads/20230202_Average_road_fuel_sales_deliveries_and_stock_levels.csv  | 
head -n 15 | cut -c1-150
"Average road fuel deliveries at sampled filling stations: United Kingdom, from 
27 January 2020 [note 1][note 2][note 3]"
This worksheet contains one table. Some cells refer to notes which can be found 
in the notes worksheet.,,,
"Freeze panes are turned on. To turn off freeze panes select the 'View' ribbon 
then 'Freeze Panes' then 'Unfreeze Panes' or use [Alt,W,F]"
Source: 
BEIS,,
Released: 02 February 
2023
Return to 
contents
Units: Volume in 
litres,,,
Date,Weekday,Fuel Type,North East,North West,Yorkshire and The Humber,"East
Midlands","West
Midlands",East,London,South East,South West,Northern 
Ireland,Wales,Scotland,"England
[note 3]",United 
Kingdom,,
27/01/2020,Monday,Diesel," 10,583 "," 9,422 "," 11,687 "," 11,205 "," 11,353 
"," 10,284 "," 7,501 "," 10,023 "," 9,535 "," 8,511 "," 9,961 "," 9,600 "
28/01/2020,Tuesday,Diesel," 11,643 "," 10,440 "," 13,172 "," 11,885 "," 12,943 
"," 12,255 "," 7,310 "," 10,106 "," 11,144 "," 7,740 "," 10,306 "," 10,
29/01/2020,Wednesday,Diesel," 10,839 "," 10,021 "," 11,417 "," 12,195 "," 
11,370 "," 12,542 "," 8,102 "," 11,235 "," 10,840 "," 6,943 "," 11,532 "," 9
30/01/2020,Thursday,Diesel," 8,808 "," 10,673 "," 11,871 "," 13,469 "," 12,727 
"," 12,445 "," 7,708 "," 11,044 "," 9,741 "," 7,456 "," 10,647 "," 10,2
{code}

> Support row skipping when reading CSV files
> ---
>
> Key: SPARK-42359
> URL: https://issues.apache.org/jira/browse/SPARK-42359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screenshot 2023-02-06 at 13.23.34.png
>
>
> Spark currently can't read CSV files that contain lines with comments or 
> annotations above the header and data. Work-arounds include pre-processing 
> CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these 
> increase friction for less technical users.
> This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a 
> number of unwanted lines at the top of a CSV file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42359) Support row skipping when reading CSV files

2023-02-06 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-42359:
--
Attachment: Screenshot 2023-02-06 at 13.23.34.png

> Support row skipping when reading CSV files
> ---
>
> Key: SPARK-42359
> URL: https://issues.apache.org/jira/browse/SPARK-42359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screenshot 2023-02-06 at 13.23.34.png
>
>
> Spark currently can't read CSV files that contain lines with comments or 
> annotations above the header and data. Work-arounds include pre-processing 
> CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these 
> increase friction for less technical users.
> This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a 
> number of unwanted lines at the top of a CSV file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42359) Support row skipping when reading CSV files

2023-02-06 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-42359:
-

 Summary: Support row skipping when reading CSV files
 Key: SPARK-42359
 URL: https://issues.apache.org/jira/browse/SPARK-42359
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.1
Reporter: Willi Raschkowski


Spark currently can't read CSV files that contain lines with comments or 
annotations above the header and data. Work-arounds include pre-processing 
CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these 
increase friction for less technical users.

This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a 
number of unwanted lines at the top of a CSV file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33760) Extend Dynamic Partition Pruning Support to DataSources

2022-08-25 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584882#comment-17584882
 ] 

Willi Raschkowski commented on SPARK-33760:
---

Is this related to SPARK-35779?

> Extend Dynamic Partition Pruning Support to DataSources
> ---
>
> Key: SPARK-33760
> URL: https://issues.apache.org/jira/browse/SPARK-33760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Anoop Johnson
>Priority: Major
>
> The implementation of Dynamic Partition Pruning  (DPP) in Spark is 
> [specific|https://github.com/apache/spark/blob/fb2e3af4b5d92398d57e61b766466cc7efd9d7cb/sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala#L59-L64]
>  to HadoopFSRelation. As a result, DPP is not triggered for queries that use 
> data sources. 
> The DataSource v2 readers can expose the partition metadata. Can we use this 
> metadata and extend DPP to work on data sources as well?
> Would appreciate thoughts or corner cases we need to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-22 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570048#comment-17570048
 ] 

Willi Raschkowski commented on SPARK-39659:
---

[~hyukjin.kwon], we talked about this at the conference. Basically Spark 
somewhat supports running Python from environments. But it doesn't run "conda 
activate" and the equivalent for other package managers. We could approximate 
"conda activate" by updating the PATH. 

Curious if you think this is a general problem. Or if you think Spark shouldn't 
be solving it. Cluster owners should solve it by setting PATH themselves.

The reason we don't do that is that we want the {{PATH}} elements to be 
absolute locations (to support e.g. {{Popen}} following a {{chdir}}) and for 
YARN we don't know the working directory location in advance.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 99, in executable_path
> raise ValueError(
> ValueError: 
> The kaleido executable is required by the kaleido Python library, but it was 
> not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido executable at:
> 
> /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/app

[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561661#comment-17561661
 ] 

Willi Raschkowski commented on SPARK-39659:
---

Another way could be to add a config like 
{{spark.(driver|executor).extraPathDirs}} which the PythonRunner (and RRunner) 
apply to their {{ProcessBuilder}}.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 99, in executable_path
> raise ValueError(
> ValueError: 
> The kaleido executable is required by the kaleido Python library, but it was 
> not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido executable at:
> 
> /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/executable/kaleido
>  
> Searched for executable 'kaleido' on the following system PATH:
> /usr/local/sbin
> /usr/local/bin
> /usr/sbin
> /usr/bin
> /sbin
> /bin
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issue

[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656
 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 3:16 AM:
---

Alternatively, one could update {{PATH}} to point to something like 
{{./environment/bin/}}.

-But when using k8s, we don't know the location of the environment on driver 
beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't 
right. For drivers in k8s cluster mode, the environment is downloaded into the 
working directory. So we can add the working directory to {{PATH}}. 

Still, that's inconvenient to do because you need to modify infrastructure, 
i.e. set {{PATH}} on YARN nodes or in the K8s image.


was (Author: raschkowski):
Alternatively, one could update {{PATH}} to point to something like 
{{./environment/bin/}}.

-But when using k8s, we don't know the location of the environment on driver 
beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't 
right. For drivers in k8s cluster mode, the environment is downloaded into the 
working directory. So we can add the working directory to {{PATH}}. 

Still, that's inconvenient to do: 1) Using relative paths in {{PATH}} is 
fickle. 2) You need to modify infrastructure, i.e. set {{PATH}} on YARN nodes 
or in the K8s image.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/

[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656
 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 3:15 AM:
---

Alternatively, one could update {{PATH}} to point to something like 
{{./environment/bin/}}.

-But when using k8s, we don't know the location of the environment on driver 
beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't 
right. For drivers in k8s cluster mode, the environment is downloaded into the 
working directory. So we can add the working directory to {{PATH}}. 

Still, that's inconvenient to do: 1) Using relative paths in {{PATH}} is 
fickle. 2) You need to modify infrastructure, i.e. set {{PATH}} on YARN nodes 
or in the K8s image.


was (Author: raschkowski):
Alternatively, one could update {{PATH}} to point to something like 
{{{}./environment/bin/{}}}.

-But when using k8s, we don't know the location of the environment on driver 
beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't 
right. For drivers in k8s cluster mode, the environment is downloaded into the 
working directory.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido

[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656
 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 3:08 AM:
---

Alternatively, one could update {{PATH}} to point to something like 
{{{}./environment/bin/{}}}.

-But when using k8s, we don't know the location of the environment on driver 
beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't 
right. For drivers in k8s cluster mode, the environment is downloaded into the 
working directory.


was (Author: raschkowski):
Alternatively, one could update {{PATH}} to point to something like 
{{{}./environment/bin/{}}}.

~But when using k8s, we don't know the location of the environment on driver 
beforehand, because it's unarchived under {{{}SparkFiles{}}}.~ Actually, this 
isn't right. See comment below.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 99, in executable_path
> raise ValueError(
> ValueError: 
> The kaleido executable is required by the kaleido Python library, but it was 
> not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido execu

[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656
 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 3:06 AM:
---

Alternatively, one could update {{PATH}} to point to something like 
{{{}./environment/bin/{}}}.

~But when using k8s, we don't know the location of the environment on driver 
beforehand, because it's unarchived under {{{}SparkFiles{}}}.~ Actually, this 
isn't right. See comment below.


was (Author: raschkowski):
Alternatively, one could update {{PATH}} to point to something like 
{{./environment/bin/}}.

But when using k8s, we don't know the location of the environment on driver 
beforehand, because it's unarchived under {{SparkFiles}}.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 99, in executable_path
> raise ValueError(
> ValueError: 
> The kaleido executable is required by the kaleido Python library, but it was 
> not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido executable at:
> 
> /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_165

[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561657#comment-17561657
 ] 

Willi Raschkowski commented on SPARK-39659:
---

Anyway, wanted to get your thoughts on this.

If you think that adding the parent folder of the Python executable is the 
right move (i.e. for {{./environment/bin/python}} we do 
{{$PATH:./environment/bin}}, I can put up a PR.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 99, in executable_path
> raise ValueError(
> ValueError: 
> The kaleido executable is required by the kaleido Python library, but it was 
> not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido executable at:
> 
> /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/executable/kaleido
>  
> Searched for executable 'kaleido' on the following system PATH:
> /usr/local/sbin
> /usr/local/bin
> /usr/sbin
> /usr/bin
> /sbin
> /bin
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

--

[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655
 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:56 AM:
---

The way we solve this in our fork is by doing something like
{code:scala}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent,
 builder)
{code}
with
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the 
environment
   * containing those executables was added at runtime (e.g. via 
sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): 
Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
  Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code}
inside the driver- and executor-side PythonRunners.


was (Author: raschkowski):
The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the 
environment
   * containing those executables was added at runtime (e.g. via 
sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): 
Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
  Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code}
and 
{code:scala}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent,
 builder)
{code}
inside the driver- and executor-side PythonRunners.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base

[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656
 ] 

Willi Raschkowski commented on SPARK-39659:
---

Alternatively, one could update {{PATH}} to point to something like 
{{./environment/bin/}}. But when using k8s, we don't know the location of the 
environment on driver beforehand, because it's unarchived under {{SparkFiles}}.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 99, in executable_path
> raise ValueError(
> ValueError: 
> The kaleido executable is required by the kaleido Python library, but it was 
> not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido executable at:
> 
> /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/executable/kaleido
>  
> Searched for executable 'kaleido' on the following system PATH:
> /usr/local/sbin
> /usr/local/bin
> /usr/sbin
> /usr/bin
> /sbin
> /bin
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-

[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656
 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:55 AM:
---

Alternatively, one could update {{PATH}} to point to something like 
{{./environment/bin/}}.

But when using k8s, we don't know the location of the environment on driver 
beforehand, because it's unarchived under {{SparkFiles}}.


was (Author: raschkowski):
Alternatively, one could update {{PATH}} to point to something like 
{{./environment/bin/}}. But when using k8s, we don't know the location of the 
environment on driver beforehand, because it's unarchived under {{SparkFiles}}.

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 99, in executable_path
> raise ValueError(
> ValueError: 
> The kaleido executable is required by the kaleido Python library, but it was 
> not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido executable at:
> 
> /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-p

[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655
 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:54 AM:
---

The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the 
environment
   * containing those executables was added at runtime (e.g. via 
sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): 
Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
  Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code}
and 
{code:scala}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent,
 builder)
{code}
inside the driver- and executor-side PythonRunners.


was (Author: raschkowski):
The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the 
environment
   * containing those executables was added at runtime (e.g. via 
sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): 
Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
  Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code}
and 
{code:scala}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent,
 builder)
{code}

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_arg

[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655
 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:53 AM:
---

The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the 
environment
   * containing those executables was added at runtime (e.g. via 
sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): 
Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
  Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code}
and 
{code:scala}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent,
 builder)
{code}


was (Author: raschkowski):
The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the 
environment
   * containing those executables was added at runtime (e.g. via 
sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): 
Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
  Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code:scala}
and 
{code}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent,
 builder)
{code}

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop

[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655
 ] 

Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:53 AM:
---

The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the 
environment
   * containing those executables was added at runtime (e.g. via 
sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): 
Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
  Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code:scala}
and 
{code}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent,
 builder)
{code}


was (Author: raschkowski):
The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the 
environment
   * containing those executables was added at runtime (e.g. via 
sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): 
Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
  Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code}
and 
{code}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent,
 builder)
{code}

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoo

[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655
 ] 

Willi Raschkowski commented on SPARK-39659:
---

The way we solve this in our fork is by doing something like
{code:scala}
  /**
   * Append the directory to the subprocess' PATH environment variable.
   *
   * This allows the Python subprocess to find additional executables when the 
environment
   * containing those executables was added at runtime (e.g. via 
sc.addArchive()).
   */
  def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): 
Unit = {
processBuilder.environment().compute("PATH", (_, oldPath) =>
  Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString))
  }
{code}
and 
{code}
PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent,
 builder)
{code}

> Add environment bin folder to R/Python subprocess PATH
> --
>
> Key: SPARK-39659
> URL: https://issues.apache.org/jira/browse/SPARK-39659
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> Some Python packages rely on non-Python executables which are usually made 
> available on the {{PATH}} through something like {{{}conda activate{}}}.
> When using Spark with conda-pack environments added via 
> {{{}spark.archives{}}}, Python packages aren't able to find conda-installed 
> executables because Spark doesn't update {{{}PATH{}}}.
> E.g.
> {code:java|title=test.py}
> # This only works if kaleido-python can find the conda-installed executable
> fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
> color="species")
> fig.write_image("figure.png", engine="kaleido")
> {code}
> and
> {code:java}
> ./bin/spark-submit --master yarn --deploy-mode cluster --archives 
> environment.tar.gz#environment --conf 
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
> {code}
> will throw
> {code:java}
> Traceback (most recent call last):
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
>  line 7, in 
> fig.write_image("figure.png", engine="kaleido")
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
>  line 3829, in write_image
> return pio.write_image(self, *args, **kwargs)
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 267, in write_image
> img_data = to_image(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
>  line 144, in to_image
> img_bytes = scope.transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
>  line 153, in transform
> response = self._perform_transform(
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 293, in _perform_transform
> self._ensure_kaleido()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 176, in _ensure_kaleido
> proc_args = self._build_proc_args()
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 123, in _build_proc_args
> proc_args = [self.executable_path(), self.scope_name]
>   File 
> "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
>  line 99, in executable_path
> raise ValueError(
> ValueError: 
> The kaleido executable is required by the kaleido Python library, but it was 
> not included
> in the Python package and it could not be found on the system PATH.
> Searched for included kaleido executable at:
> 
> /

[jira] [Created] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH

2022-07-01 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-39659:
-

 Summary: Add environment bin folder to R/Python subprocess PATH
 Key: SPARK-39659
 URL: https://issues.apache.org/jira/browse/SPARK-39659
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Willi Raschkowski


Some Python packages rely on non-Python executables which are usually made 
available on the {{PATH}} through something like {{{}conda activate{}}}.

When using Spark with conda-pack environments added via {{{}spark.archives{}}}, 
Python packages aren't able to find conda-installed executables because Spark 
doesn't update {{{}PATH{}}}.

E.g.
{code:java|title=test.py}
# This only works if kaleido-python can find the conda-installed executable
fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", 
color="species")
fig.write_image("figure.png", engine="kaleido")
{code}
and
{code:java}
./bin/spark-submit --master yarn --deploy-mode cluster --archives 
environment.tar.gz#environment --conf 
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py
{code}
will throw
{code:java}
Traceback (most recent call last):
  File 
"/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py",
 line 7, in 
fig.write_image("figure.png", engine="kaleido")
  File 
"/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py",
 line 3829, in write_image
return pio.write_image(self, *args, **kwargs)
  File 
"/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
 line 267, in write_image
img_data = to_image(
  File 
"/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py",
 line 144, in to_image
img_bytes = scope.transform(
  File 
"/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py",
 line 153, in transform
response = self._perform_transform(
  File 
"/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
 line 293, in _perform_transform
self._ensure_kaleido()
  File 
"/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
 line 176, in _ensure_kaleido
proc_args = self._build_proc_args()
  File 
"/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
 line 123, in _build_proc_args
proc_args = [self.executable_path(), self.scope_name]
  File 
"/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py",
 line 99, in executable_path
raise ValueError(
ValueError: 
The kaleido executable is required by the kaleido Python library, but it was 
not included
in the Python package and it could not be found on the system PATH.

Searched for included kaleido executable at:

/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/executable/kaleido
 

Searched for executable 'kaleido' on the following system PATH:
/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39107) Silent change in regexp_replace's handling of empty strings

2022-05-05 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-39107:
--
Labels: correctness  (was: )

> Silent change in regexp_replace's handling of empty strings
> ---
>
> Key: SPARK-39107
> URL: https://issues.apache.org/jira/browse/SPARK-39107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: correctness
>
> Hi, we just upgraded from 3.0.2 to 3.1.2 and noticed a silent behavior change 
> that a) seems incorrect, and b) is undocumented in the [migration 
> guide|https://spark.apache.org/docs/latest/sql-migration-guide.html]:
> {code:title=3.0.2}
> scala> val df = spark.sql("SELECT '' AS col")
> df: org.apache.spark.sql.DataFrame = [col: string]
> scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", 
> "")).show
> +---++
> |col|replaced|
> +---++
> |   | |
> +---++
> {code}
> {code:title=3.1.2}
> scala> val df = spark.sql("SELECT '' AS col")
> df: org.apache.spark.sql.DataFrame = [col: string]
> scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", 
> "")).show
> +---++
> |col|replaced|
> +---++
> |   ||
> +---++
> {code}
> Note, the regular expression {{^$}} should match the empty string, but 
> doesn't in version 3.1. E.g. this is the Java behavior:
> {code}
> scala> "".replaceAll("^$", "");
> res1: String = 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39107) Silent change in regexp_replace's handling of empty strings

2022-05-05 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-39107:
-

 Summary: Silent change in regexp_replace's handling of empty 
strings
 Key: SPARK-39107
 URL: https://issues.apache.org/jira/browse/SPARK-39107
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: Willi Raschkowski


Hi, we just upgraded from 3.0.2 to 3.1.2 and noticed a silent behavior change 
that a) seems incorrect, and b) is undocumented in the [migration 
guide|https://spark.apache.org/docs/latest/sql-migration-guide.html]:

{code:title=3.0.2}
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", 
"")).show
+---++
|col|replaced|
+---++
|   | |
+---++
{code}

{code:title=3.1.2}
scala> val df = spark.sql("SELECT '' AS col")
df: org.apache.spark.sql.DataFrame = [col: string]

scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", 
"")).show
+---++
|col|replaced|
+---++
|   ||
+---++
{code}

Note, the regular expression {{^$}} should match the empty string, but doesn't 
in version 3.1. E.g. this is the Java behavior:

{code}
scala> "".replaceAll("^$", "");
res1: String = 
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-28 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529381#comment-17529381
 ] 

Willi Raschkowski edited comment on SPARK-39044 at 4/28/22 11:29 AM:
-

[~hyukjin.kwon], yes I know. But I wasn't able to get a self-contained 
reproducer. This reliably fails in prod. But using that same 
TypedImperativeAggregate with {{observe()}} in local tests works fine.

If you have ideas on what to try, I will. (Also happy to share the aggregate, 
but from the stacktrace I understood the implementation isn't relevant - it's 
the {{AggregatingAccumulator}} buffer that is {{null}}. Anyway, I attached 
[^aggregate.scala].)

I understand if you close this ticket because you cannot root-cause without a 
repro.


was (Author: raschkowski):
[~hyukjin.kwon], yes I know. But I wasn't able to get a self-contained 
reproducer. This reliably fails in prod. But using that same 
TypedImperativeAggregate with {{observe()}} in local tests works fine.

If you have ideas on what to try, I will. (Also happy to share the aggregate, 
but from the stacktrace I understood the implementation isn't relevant - it's 
the {{AggregatingAccumulator}} buffer that is {{{}null{}}}.)

I understand if you close this ticket because you cannot root-cause without a 
repro.

> AggregatingAccumulator with TypedImperativeAggregate throwing 
> NullPointerException
> --
>
> Key: SPARK-39044
> URL: https://issues.apache.org/jira/browse/SPARK-39044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: aggregate.scala
>
>
> We're using a custom TypedImperativeAggregate inside an 
> AggregatingAccumulator (via {{observe()}} and get the error below. It looks 
> like we're trying to serialize an aggregation buffer that hasn't been 
> initialized yet.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
>   ...
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
> java.lang.NullPointerException
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
>   at 
> java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
>   at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
>   at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(D

[jira] [Updated] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-28 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-39044:
--
Attachment: aggregate.scala

> AggregatingAccumulator with TypedImperativeAggregate throwing 
> NullPointerException
> --
>
> Key: SPARK-39044
> URL: https://issues.apache.org/jira/browse/SPARK-39044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: aggregate.scala
>
>
> We're using a custom TypedImperativeAggregate inside an 
> AggregatingAccumulator (via {{observe()}} and get the error below. It looks 
> like we're trying to serialize an aggregation buffer that hasn't been 
> initialized yet.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
>   ...
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
> java.lang.NullPointerException
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
>   at 
> java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
>   at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
>   at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
>   at 
> scala.runtime.java8

[jira] [Commented] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-28 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529381#comment-17529381
 ] 

Willi Raschkowski commented on SPARK-39044:
---

[~hyukjin.kwon], yes I know. But I wasn't able to get a self-contained 
reproducer. This reliably fails in prod. But using that same 
TypedImperativeAggregate with {{observe()}} in local tests works fine.

If you have ideas on what to try, I will. (Also happy to share the aggregate, 
but from the stacktrace I understood the implementation isn't relevant - it's 
the {{AggregatingAccumulator}} buffer that is {{{}null{}}}.)

I understand if you close this ticket because you cannot root-cause without a 
repro.

> AggregatingAccumulator with TypedImperativeAggregate throwing 
> NullPointerException
> --
>
> Key: SPARK-39044
> URL: https://issues.apache.org/jira/browse/SPARK-39044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> We're using a custom TypedImperativeAggregate inside an 
> AggregatingAccumulator (via {{observe()}} and get the error below. It looks 
> like we're trying to serialize an aggregation buffer that hasn't been 
> initialized yet.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
>   ...
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
> java.lang.NullPointerException
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
>   at 
> java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
>   at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
>   at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:94

[jira] [Commented] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-27 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528751#comment-17528751
 ] 

Willi Raschkowski commented on SPARK-39044:
---

This worked on Spark 3.0.

[~beliefer], given we're hitting this in {{withBufferSerialized}}, I think this 
might be related to SPARK-37203.

> AggregatingAccumulator with TypedImperativeAggregate throwing 
> NullPointerException
> --
>
> Key: SPARK-39044
> URL: https://issues.apache.org/jira/browse/SPARK-39044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> We're using a custom TypedImperativeAggregate inside an 
> AggregatingAccumulator (via {{observe()}} and get the error below. It looks 
> like we're trying to serialize an aggregation buffer that hasn't been 
> initialized yet.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
>   ...
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in 
> stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
> java.lang.NullPointerException
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
>   at 
> java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
>   at 
> java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
>   at 
> org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
>   at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
>   at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
>   at 
> java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at

[jira] [Updated] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-27 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-39044:
--
Description: 
We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get the error below. It looks like we're trying to 
serialize an aggregation buffer that hasn't been initialized yet.
{code}
Caused by: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
at 
org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540)
...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in stage 
1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: 
java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
at 
java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
at 
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
at 
org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1428)
... 11 more
{code}

  was:
We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get the error below. It looks like we're trying to 
serialize an aggregation buffer that hasn't been initialized yet.
{code}
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(Ta

[jira] [Updated] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-27 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-39044:
--
Description: 
We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get the error below. It looks like we're trying to 
serialize an aggregation buffer that hasn't been initialized yet.
{code}
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
at 
java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
at 
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
... 1 more
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
at 
org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1428)
... 11 more
{code}

  was:
We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get this error below:
{code}
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
at 
java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
at 
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.ru

[jira] [Created] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException

2022-04-27 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-39044:
-

 Summary: AggregatingAccumulator with TypedImperativeAggregate 
throwing NullPointerException
 Key: SPARK-39044
 URL: https://issues.apache.org/jira/browse/SPARK-39044
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Willi Raschkowski


We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator 
(via {{observe()}} and get this error below:
{code}
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435)
at 
org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51)
at 
java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460)
at 
java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
... 1 more
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205)
at 
org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33)
at 
org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186)
at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235)
at 
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137)
at 
java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1428)
... 11 more
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36768) Cannot resolve attribute with table reference

2022-02-21 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-36768:
--
Issue Type: Bug  (was: Task)

> Cannot resolve attribute with table reference
> -
>
> Key: SPARK-36768
> URL: https://issues.apache.org/jira/browse/SPARK-36768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.3, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark seems in some cases unable to resolve attributes that contain 
> multi-part names where the first parts reference a table. Here's a repro:
> {code:python}
> >>> spark.range(3).toDF("col").write.parquet("testdata")
> # Single name part attribute is fine
> >>> spark.sql("SELECT col FROM parquet.testdata").show()
> +---+
> |col|
> +---+
> |  1|
> |  0|
> |  2|
> +---+
> # Name part with the table reference fails
> >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()
> AnalysisException: cannot resolve '`parquet.testdata.col`' given input 
> columns: [col]; line 1 pos 7;
> 'Project ['parquet.testdata.col]
> +- Relation[col#50L] parquet
> {code}
> The expected behavior is that {{parquet.testdata.col}} is recognized as 
> referring to attribute {{col}} in {{parquet.testdata}} (you'd expect 
> {{AttributeSeq.resolve}} matches [this 
> case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-38166:
--
Description: 
We're seeing duplicates after running the following
{code:java}
def compute_shipments(shipments):
shipments = shipments.dropDuplicates(["ship_trck_num"])
shipments = shipments.repartition(4)
return shipments
{code}
and observing lost executors (OOMs) and task retries in the repartition stage.

We're seeing this reliably in one of our pipelines. But I haven't managed to 
reproduce outside of that pipeline. I'll attach driver logs - maybe you have 
ideas.

  was:
We're seeing duplicates after running the following 

{code}
def compute_shipments(shipments):
shipments = shipments.dropDuplicates(["ship_trck_num"])
shipments = shipments.repartition(4)
return shipments
{code}

and observing lost executors (OOMs) and task retries in the repartition stage.

We're seeing this reliably in one of our pipelines. But I haven't managed to 
reproduce outside of that pipeline. I'll attach driver logs and the 
notionalized input data - maybe you have ideas.


> Duplicates after task failure in dropDuplicates and repartition
> ---
>
> Key: SPARK-38166
> URL: https://issues.apache.org/jira/browse/SPARK-38166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
> Environment: Cluster runs on K8s. AQE is enabled.
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: correctness
> Attachments: driver.log
>
>
> We're seeing duplicates after running the following
> {code:java}
> def compute_shipments(shipments):
> shipments = shipments.dropDuplicates(["ship_trck_num"])
> shipments = shipments.repartition(4)
> return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to 
> reproduce outside of that pipeline. I'll attach driver logs - maybe you have 
> ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489579#comment-17489579
 ] 

Willi Raschkowski commented on SPARK-38166:
---

Linking SPARK-23207 (which is closed but looks very related) and SPARK-25342 
(which is open but I understand would only explain this if we were operating on 
RDDs).

> Duplicates after task failure in dropDuplicates and repartition
> ---
>
> Key: SPARK-38166
> URL: https://issues.apache.org/jira/browse/SPARK-38166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
> Environment: Cluster runs on K8s. AQE is enabled.
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: correctness
> Attachments: driver.log
>
>
> We're seeing duplicates after running the following 
> {code}
> def compute_shipments(shipments):
> shipments = shipments.dropDuplicates(["ship_trck_num"])
> shipments = shipments.repartition(4)
> return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to 
> reproduce outside of that pipeline. I'll attach driver logs and the 
> notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489523#comment-17489523
 ] 

Willi Raschkowski commented on SPARK-38166:
---

Attaching driver logs:  [^driver.log]

Notable lines are probably:
{code:java}
...
INFO  [2021-11-11T23:04:13.68737Z] org.apache.spark.scheduler.TaskSetManager: 
Task 1.1 in stage 6.0 (TID 60) failed, but the task will not be re-executed 
(either because the task failed with a shuffle data fetch failure, so the 
previous stage needs to be re-run, or because a different copy of the task has 
already succeeded).
INFO  [2021-11-11T23:04:13.687562Z] org.apache.spark.scheduler.DAGScheduler: 
Marking ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) as 
failed due to a fetch failure from ShuffleMapStage 5 (writeAndRead at 
CustomSaveDatasetCommand.scala:218)
INFO  [2021-11-11T23:04:13.688643Z] org.apache.spark.scheduler.DAGScheduler: 
ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) failed in 
1012.545 s due to org.apache.spark.shuffle.FetchFailedException: The relative 
remote executor(Id: 2), which maintains the block data to fetch is dead.
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:663)
...
Caused by: org.apache.spark.ExecutorDeadException: The relative remote 
executor(Id: 2), which maintains the block data to fetch is dead.
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:132)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
...

INFO  [2021-11-11T23:04:13.690385Z] org.apache.spark.scheduler.DAGScheduler: 
Resubmitting ShuffleMapStage 5 (writeAndRead at 
CustomSaveDatasetCommand.scala:218) and ResultStage 6 (writeAndRead at 
CustomSaveDatasetCommand.scala:218) due to fetch failure
INFO  [2021-11-11T23:04:13.894248Z] org.apache.spark.scheduler.DAGScheduler: 
Resubmitting failed stages
...
{code}

> Duplicates after task failure in dropDuplicates and repartition
> ---
>
> Key: SPARK-38166
> URL: https://issues.apache.org/jira/browse/SPARK-38166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
> Environment: Cluster runs on K8s. AQE is enabled.
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: correctness
> Attachments: driver.log
>
>
> We're seeing duplicates after running the following 
> {code}
> def compute_shipments(shipments):
> shipments = shipments.dropDuplicates(["ship_trck_num"])
> shipments = shipments.repartition(4)
> return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to 
> reproduce outside of that pipeline. I'll attach driver logs and the 
> notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-38166:
--
Attachment: driver.log

> Duplicates after task failure in dropDuplicates and repartition
> ---
>
> Key: SPARK-38166
> URL: https://issues.apache.org/jira/browse/SPARK-38166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
> Environment: Cluster runs on K8s. AQE is enabled.
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: correctness
> Attachments: driver.log
>
>
> We're seeing duplicates after running the following 
> {code}
> def compute_shipments(shipments):
> shipments = shipments.dropDuplicates(["ship_trck_num"])
> shipments = shipments.repartition(4)
> return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to 
> reproduce outside of that pipeline. I'll attach driver logs and the 
> notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-38166:
-

 Summary: Duplicates after task failure in dropDuplicates and 
repartition
 Key: SPARK-38166
 URL: https://issues.apache.org/jira/browse/SPARK-38166
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2
 Environment: Cluster runs on K8s. AQE is enabled.
Reporter: Willi Raschkowski


We're seeing duplicates after running the following 

{code}
def compute_shipments(shipments):
shipments = shipments.dropDuplicates(["ship_trck_num"])
shipments = shipments.repartition(4)
return shipments
{code}

and observing lost executors (OOMs) and task retries in the repartition stage.

We're seeing this reliably in one of our pipelines. But I haven't managed to 
reproduce outside of that pipeline. I'll attach driver logs and the 
notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37465) PySpark tests failing on Pandas 0.23

2021-11-26 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449619#comment-17449619
 ] 

Willi Raschkowski commented on SPARK-37465:
---

I'll give the pandas bump a shot.

> PySpark tests failing on Pandas 0.23
> 
>
> Key: SPARK-37465
> URL: https://issues.apache.org/jira/browse/SPARK-37465
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Willi Raschkowski
>Priority: Major
>
> I was running Spark tests with Pandas {{0.23.4}} and got the error below. The 
> minimum Pandas version is currently {{0.23.2}} 
> [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. 
> Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix 
> (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222]
>  in Pandas.
> {code:java}
> $ python/run-tests --testnames 
> 'pyspark.pandas.tests.data_type_ops.test_boolean_ops 
> BooleanOpsTest.test_floordiv'
> ...
> ==
> ERROR [5.785s]: test_floordiv 
> (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py",
>  line 128, in test_floordiv
> self.assert_eq(b_pser // b_pser.astype(int), b_psser // 
> b_psser.astype(int))
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1069, in wrapper
> result = safe_na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1033, in safe_na_op
> return na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1027, in na_op
> result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py",
>  line 641, in fill_zeros
> signs = np.sign(y if name.startswith(('r', '__r')) else x)
> TypeError: ufunc 'sign' did not contain a loop with signature matching types 
> dtype('bool') dtype('bool')
> {code}
> These are my relevant package versions:
> {code:java}
> $ conda list | grep -e numpy -e pyarrow -e pandas -e python
> # packages in environment at /home/circleci/miniconda/envs/python3:
> numpy 1.16.6   py36h0a8e133_3  
> numpy-base1.16.6   py36h41b4c56_3  
> pandas0.23.4   py36h04863e7_0  
> pyarrow   1.0.1   py36h6200943_36_cpuconda-forge
> python3.6.12   hcff3b4d_2anaconda
> python-dateutil   2.8.1  py_0anaconda
> python_abi3.6 1_cp36mconda-forg
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37465) PySpark tests failing on Pandas 0.23

2021-11-26 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449584#comment-17449584
 ] 

Willi Raschkowski edited comment on SPARK-37465 at 11/26/21, 2:13 PM:
--

I also noticed that {{CategoricalOpsTest}} fails on pandas 0.25.3 (latest 0.x) 
and works on 1.x:
{code:java}
$ conda list | grep pandas
pandas0.25.3   py36he6710b0_0
$ python/run-tests --testnames 
'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest'
...
Running tests...
--
/home/circleci/project/python/pyspark/context.py:238: FutureWarning: Python 3.6 
support is deprecated in Spark 3.2.
  FutureWarning
  test_abs 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (2.353s)
  test_add 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (1.382s)
  test_and 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.265s)
ok (6.569s) 
alOpsTest) ... 
  test_eq 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (1.514s)
  test_floordiv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.910s)
  test_from_to_pandas 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.143s)
  test_ge 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.795s)
  test_gt 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.891s)
  test_invert 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.044s)
  test_isnull 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.097s)
  test_le 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.863s)
  test_lt 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.844s)
  test_mod 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.897s)
  test_mul 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.860s)
  test_ne 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (1.405s)
  test_neg 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.044s)
  test_or 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.160s)
  test_pow 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.821s)
  test_radd 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.081s)
  test_rand 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.100s)
  test_rfloordiv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.083s)
  test_rmod 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.050s)
  test_rmul 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.079s)
  test_ror 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.095s)
  test_rpow 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.078s)
  test_rsub 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.078s)
  test_rtruediv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.079s)
  test_sub 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.818s)
  test_truediv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.832s)

==
FAIL [1.611s]: test_eq 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest)
--
Traceback (most recent call last):
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 
122, in assertPandasEqual
**kwargs
  File 
"/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py",
 line 1248, in assert_series_equal
assert_attr_equal('name', left, right, obj=obj)
  File 
"/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py",
 line 941, in assert_attr_equal
raise_assert_detail(obj, msg, left_attr, right_attr)
AssertionError: Series are different

Attribute "name" are different
[left]:  that_numeric_cat
[right]: None

The above e

[jira] [Commented] (SPARK-37465) PySpark tests failing on Pandas 0.23

2021-11-26 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449584#comment-17449584
 ] 

Willi Raschkowski commented on SPARK-37465:
---

I also noticed another that {{CategoricalOpsTest}} fails on pandas 0.25.3 
(latest 0.x) and works on 1.x:
{code:java}
$ conda list | grep pandas
pandas0.25.3   py36he6710b0_0
$ python/run-tests --testnames 
'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest'
...
Running tests...
--
/home/circleci/project/python/pyspark/context.py:238: FutureWarning: Python 3.6 
support is deprecated in Spark 3.2.
  FutureWarning
  test_abs 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (2.353s)
  test_add 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (1.382s)
  test_and 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.265s)
ok (6.569s) 
alOpsTest) ... 
  test_eq 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (1.514s)
  test_floordiv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.910s)
  test_from_to_pandas 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.143s)
  test_ge 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.795s)
  test_gt 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.891s)
  test_invert 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.044s)
  test_isnull 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.097s)
  test_le 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.863s)
  test_lt 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.844s)
  test_mod 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.897s)
  test_mul 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.860s)
  test_ne 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (1.405s)
  test_neg 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.044s)
  test_or 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.160s)
  test_pow 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.821s)
  test_radd 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.081s)
  test_rand 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.100s)
  test_rfloordiv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.083s)
  test_rmod 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.050s)
  test_rmul 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.079s)
  test_ror 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.095s)
  test_rpow 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.078s)
  test_rsub 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.078s)
  test_rtruediv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.079s)
  test_sub 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.818s)
  test_truediv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.832s)

==
FAIL [1.611s]: test_eq 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest)
--
Traceback (most recent call last):
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 
122, in assertPandasEqual
**kwargs
  File 
"/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py",
 line 1248, in assert_series_equal
assert_attr_equal('name', left, right, obj=obj)
  File 
"/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py",
 line 941, in assert_attr_equal
raise_assert_detail(obj, msg, left_attr, right_attr)
AssertionError: Series are different

Attribute "name" are different
[left]:  that_numeric_cat
[right]: None

The above exception was the direct cause of the followin

[jira] [Updated] (SPARK-37465) PySpark tests failing on Pandas 0.23

2021-11-25 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-37465:
--
Description: 
I was running Spark tests with Pandas {{0.23.4}} and got the error below. The 
minimum Pandas version is currently {{0.23.2}} 
[(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. 
Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix 
(Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222]
 in Pandas.
{code:java}
$ python/run-tests --testnames 
'pyspark.pandas.tests.data_type_ops.test_boolean_ops 
BooleanOpsTest.test_floordiv'

...

==
ERROR [5.785s]: test_floordiv 
(pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest)
--
Traceback (most recent call last):
  File 
"/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py",
 line 128, in test_floordiv
self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int))
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
 line 1069, in wrapper
result = safe_na_op(lvalues, rvalues)
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
 line 1033, in safe_na_op
return na_op(lvalues, rvalues)
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
 line 1027, in na_op
result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py",
 line 641, in fill_zeros
signs = np.sign(y if name.startswith(('r', '__r')) else x)
TypeError: ufunc 'sign' did not contain a loop with signature matching types 
dtype('bool') dtype('bool')
{code}
These are my relevant package versions:
{code:java}
$ conda list | grep -e numpy -e pyarrow -e pandas -e python
# packages in environment at /home/circleci/miniconda/envs/python3:
numpy 1.16.6   py36h0a8e133_3  
numpy-base1.16.6   py36h41b4c56_3  
pandas0.23.4   py36h04863e7_0  
pyarrow   1.0.1   py36h6200943_36_cpuconda-forge
python3.6.12   hcff3b4d_2anaconda
python-dateutil   2.8.1  py_0anaconda
python_abi3.6 1_cp36mconda-forg
{code}

  was:
I was running Spark tests with Pandas {{0.23.4}} and got the error below. The 
minimum Pandas version is currently {{0.23.2}} 
[(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. 
Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix 
(Github)|https://github.com/pandas-dev/pandas/pull/21160] in Pandas.
{code:java}
$ python/run-tests --testnames 
'pyspark.pandas.tests.data_type_ops.test_boolean_ops 
BooleanOpsTest.test_floordiv'

...

==
ERROR [5.785s]: test_floordiv 
(pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest)
--
Traceback (most recent call last):
  File 
"/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py",
 line 128, in test_floordiv
self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int))
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
 line 1069, in wrapper
result = safe_na_op(lvalues, rvalues)
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
 line 1033, in safe_na_op
return na_op(lvalues, rvalues)
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
 line 1027, in na_op
result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py",
 line 641, in fill_zeros
signs = np.sign(y if name.startswith(('r', '__r')) else x)
TypeError: ufunc 'sign' did not contain a loop with signature matching types 
dtype('bool') dtype('bool')
{code}
These are my relevant package versions:
{code:java}
$ conda list | grep -e numpy -e pyarrow -e pandas -e python
# packages in environment at /home/circleci/miniconda/envs/python3:
numpy 1.16.6   py36h0a8e133_3  
numpy-base1.16.6   py36h41b4c56_3  
pandas0.23.4   py36h04863e7_0  
pyarrow   1.0.1   py36h6200943_36_cpuconda-forge
python3.6.12   hcff3b4d_2anac

[jira] [Created] (SPARK-37465) PySpark tests failing on Pandas 0.23

2021-11-25 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-37465:
-

 Summary: PySpark tests failing on Pandas 0.23
 Key: SPARK-37465
 URL: https://issues.apache.org/jira/browse/SPARK-37465
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Willi Raschkowski


I was running Spark tests with Pandas {{0.23.4}} and got the error below. The 
minimum Pandas version is currently {{0.23.2}} 
[(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. 
Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix 
(Github)|https://github.com/pandas-dev/pandas/pull/21160] in Pandas.
{code:java}
$ python/run-tests --testnames 
'pyspark.pandas.tests.data_type_ops.test_boolean_ops 
BooleanOpsTest.test_floordiv'

...

==
ERROR [5.785s]: test_floordiv 
(pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest)
--
Traceback (most recent call last):
  File 
"/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py",
 line 128, in test_floordiv
self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int))
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
 line 1069, in wrapper
result = safe_na_op(lvalues, rvalues)
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
 line 1033, in safe_na_op
return na_op(lvalues, rvalues)
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
 line 1027, in na_op
result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
  File 
"/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py",
 line 641, in fill_zeros
signs = np.sign(y if name.startswith(('r', '__r')) else x)
TypeError: ufunc 'sign' did not contain a loop with signature matching types 
dtype('bool') dtype('bool')
{code}
These are my relevant package versions:
{code:java}
$ conda list | grep -e numpy -e pyarrow -e pandas -e python
# packages in environment at /home/circleci/miniconda/envs/python3:
numpy 1.16.6   py36h0a8e133_3  
numpy-base1.16.6   py36h41b4c56_3  
pandas0.23.4   py36h04863e7_0  
pyarrow   1.0.1   py36h6200943_36_cpuconda-forge
python3.6.12   hcff3b4d_2anaconda
python-dateutil   2.8.1  py_0anaconda
python_abi3.6 1_cp36mconda-forg
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36768) Cannot resolve attribute with table reference

2021-09-15 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415631#comment-17415631
 ] 

Willi Raschkowski commented on SPARK-36768:
---

In case you wonder why we care or why we can't just re-write our query with an 
alias: Those queries without aliases are generated and are meant to be 
compatible with both Spark SQL and another SQL database (where they work).

> Cannot resolve attribute with table reference
> -
>
> Key: SPARK-36768
> URL: https://issues.apache.org/jira/browse/SPARK-36768
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.3, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark seems in some cases unable to resolve attributes that contain 
> multi-part names where the first parts reference a table. Here's a repro:
> {code:python}
> >>> spark.range(3).toDF("col").write.parquet("testdata")
> # Single name part attribute is fine
> >>> spark.sql("SELECT col FROM parquet.testdata").show()
> +---+
> |col|
> +---+
> |  1|
> |  0|
> |  2|
> +---+
> # Name part with the table reference fails
> >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()
> AnalysisException: cannot resolve '`parquet.testdata.col`' given input 
> columns: [col]; line 1 pos 7;
> 'Project ['parquet.testdata.col]
> +- Relation[col#50L] parquet
> {code}
> The expected behavior is that {{parquet.testdata.col}} is recognized as 
> referring to attribute {{col}} in {{parquet.testdata}} (you'd expect 
> {{AttributeSeq.resolve}} matches [this 
> case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36768) Cannot resolve attribute with table reference

2021-09-15 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415630#comment-17415630
 ] 

Willi Raschkowski commented on SPARK-36768:
---

In the debugger I see [on this 
line|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L227]
 {{collectMatches}} doesn't produce any matches because {{qualified3Part}} is 
an empty map. And it seems to be an empty map because the {{"col"}} attribute 
in this {{AttributeSeq}} has an empty qualifiers.

On the other hand, if you do
{code:sql}
SELECT t.col FROM parquet.testdata t
{code}
the {{"col"}} attribute in the {{AttributeSeq}} has a {{"t"}} as attribute. And 
thus we get matches [on this 
line|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L253]
 when filtering for the {{"t"}} qualifier.

Naively, that makes we wonder why in the {{"parquet.testdata.col"}} case 
{{"parquet.testdata"}} is not part of the {{"col"}} attribute's qualifier, but 
when we alias the table the alias is included as qualifier.

> Cannot resolve attribute with table reference
> -
>
> Key: SPARK-36768
> URL: https://issues.apache.org/jira/browse/SPARK-36768
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.3, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark seems in some cases unable to resolve attributes that contain 
> multi-part names where the first parts reference a table. Here's a repro:
> {code:python}
> >>> spark.range(3).toDF("col").write.parquet("testdata")
> # Single name part attribute is fine
> >>> spark.sql("SELECT col FROM parquet.testdata").show()
> +---+
> |col|
> +---+
> |  1|
> |  0|
> |  2|
> +---+
> # Name part with the table reference fails
> >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()
> AnalysisException: cannot resolve '`parquet.testdata.col`' given input 
> columns: [col]; line 1 pos 7;
> 'Project ['parquet.testdata.col]
> +- Relation[col#50L] parquet
> {code}
> The expected behavior is that {{parquet.testdata.col}} is recognized as 
> referring to attribute {{col}} in {{parquet.testdata}} (you'd expect 
> {{AttributeSeq.resolve}} matches [this 
> case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36768) Cannot resolve attribute with table reference

2021-09-15 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415621#comment-17415621
 ] 

Willi Raschkowski commented on SPARK-36768:
---

This also reproduces on master at time of writing.

> Cannot resolve attribute with table reference
> -
>
> Key: SPARK-36768
> URL: https://issues.apache.org/jira/browse/SPARK-36768
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.3, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark seems in some cases unable to resolve attributes that contain 
> multi-part names where the first parts reference a table. Here's a repro:
> {code:python}
> >>> spark.range(3).toDF("col").write.parquet("testdata")
> # Single name part attribute is fine
> >>> spark.sql("SELECT col FROM parquet.testdata").show()
> +---+
> |col|
> +---+
> |  1|
> |  0|
> |  2|
> +---+
> # Name part with the table reference fails
> >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()
> AnalysisException: cannot resolve '`parquet.testdata.col`' given input 
> columns: [col]; line 1 pos 7;
> 'Project ['parquet.testdata.col]
> +- Relation[col#50L] parquet
> {code}
> The expected behavior is that {{parquet.testdata.col}} is recognized as 
> referring to attribute {{col}} in {{parquet.testdata}} (you'd expect 
> {{AttributeSeq.resolve}} matches [this 
> case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36768) Cannot resolve attribute with table reference

2021-09-15 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-36768:
--
Description: 
Spark seems in some cases unable to resolve attributes that contain multi-part 
names where the first parts reference a table. Here's a repro:
{code:python}
>>> spark.range(3).toDF("col").write.parquet("testdata")

# Single name part attribute is fine
>>> spark.sql("SELECT col FROM parquet.testdata").show()
+---+
|col|
+---+
|  1|
|  0|
|  2|
+---+

# Name part with the table reference fails
>>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()

AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: 
[col]; line 1 pos 7;
'Project ['parquet.testdata.col]
+- Relation[col#50L] parquet
{code}

The expected behavior is that {{parquet.testdata.col}} is recognized as 
referring to attribute {{col}} in {{parquet.testdata}} (you'd expect 
{{AttributeSeq.resolve}} matches [this 
case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]).

  was:
Spark seems in some cases unable to resolve attributes that contain multi-part 
names where the first parts reference a table. Here's a repro:
{code:python}
>>> spark.range(3).toDF("col").write.parquet("testdata")

# Single name part attribute is fine
>>> spark.sql("SELECT col FROM parquet.testdata").show()
+---+
|col|
+---+
|  1|
|  0|
|  2|
+---+

# Name part with the table reference fails
>>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()

AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: 
[col]; line 1 pos 7;
'Project ['parquet.testdata.col]
+- Relation[col#50L] parquet
{code}

The expected behavior is that {{parquet.testdata.col}} is recognized as 
referring to attribute {{col}} in {{parquet.testdata}}.

This also reproduces on master at time of writing.


> Cannot resolve attribute with table reference
> -
>
> Key: SPARK-36768
> URL: https://issues.apache.org/jira/browse/SPARK-36768
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.3, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark seems in some cases unable to resolve attributes that contain 
> multi-part names where the first parts reference a table. Here's a repro:
> {code:python}
> >>> spark.range(3).toDF("col").write.parquet("testdata")
> # Single name part attribute is fine
> >>> spark.sql("SELECT col FROM parquet.testdata").show()
> +---+
> |col|
> +---+
> |  1|
> |  0|
> |  2|
> +---+
> # Name part with the table reference fails
> >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()
> AnalysisException: cannot resolve '`parquet.testdata.col`' given input 
> columns: [col]; line 1 pos 7;
> 'Project ['parquet.testdata.col]
> +- Relation[col#50L] parquet
> {code}
> The expected behavior is that {{parquet.testdata.col}} is recognized as 
> referring to attribute {{col}} in {{parquet.testdata}} (you'd expect 
> {{AttributeSeq.resolve}} matches [this 
> case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36768) Cannot resolve attribute with table reference

2021-09-15 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-36768:
--
Description: 
Spark seems in some cases unable to resolve attributes that contain multi-part 
names where the first parts reference a table. Here's a repro:
{code:python}
>>> spark.range(3).toDF("col").write.parquet("testdata")

# Single name part attribute is fine
>>> spark.sql("SELECT col FROM parquet.testdata").show()
+---+
|col|
+---+
|  1|
|  0|
|  2|
+---+

# Name part with the table reference fails
>>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()

AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: 
[col]; line 1 pos 7;
'Project ['parquet.testdata.col]
+- Relation[col#50L] parquet
{code}

The expected behavior is that {{parquet.testdata.col}} is recognized as 
referring to attribute {{col}} in {{parquet.testdata}}.

This also reproduces on master at time of writing.

  was:
Spark seems in some cases unable to resolve attributes that contain multi-part 
names where the first parts reference a table. Here's a repro:
{code:python}
>>> spark.range(3).toDF("col").write.parquet("testdata")

# Single name part attribute is fine
>>> spark.sql("SELECT col FROM parquet.testdata").show()
+---+
|col|
+---+
|  1|
|  0|
|  2|
+---+

# Name part with the table reference fails
>>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()

AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: 
[col]; line 1 pos 7;
'Project ['parquet.testdata.col]
+- Relation[col#50L] parquet
{code}

This also reproduces on master at time of writing.


> Cannot resolve attribute with table reference
> -
>
> Key: SPARK-36768
> URL: https://issues.apache.org/jira/browse/SPARK-36768
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.3, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark seems in some cases unable to resolve attributes that contain 
> multi-part names where the first parts reference a table. Here's a repro:
> {code:python}
> >>> spark.range(3).toDF("col").write.parquet("testdata")
> # Single name part attribute is fine
> >>> spark.sql("SELECT col FROM parquet.testdata").show()
> +---+
> |col|
> +---+
> |  1|
> |  0|
> |  2|
> +---+
> # Name part with the table reference fails
> >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()
> AnalysisException: cannot resolve '`parquet.testdata.col`' given input 
> columns: [col]; line 1 pos 7;
> 'Project ['parquet.testdata.col]
> +- Relation[col#50L] parquet
> {code}
> The expected behavior is that {{parquet.testdata.col}} is recognized as 
> referring to attribute {{col}} in {{parquet.testdata}}.
> This also reproduces on master at time of writing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36768) Cannot resolve attribute with table reference

2021-09-15 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-36768:
-

 Summary: Cannot resolve attribute with table reference
 Key: SPARK-36768
 URL: https://issues.apache.org/jira/browse/SPARK-36768
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.1.2, 3.0.3, 2.4.7
Reporter: Willi Raschkowski


Spark seems in some cases unable to resolve attributes that contain multi-part 
names where the first parts reference a table. Here's a repro:
{code:python}
>>> spark.range(3).toDF("col").write.parquet("testdata")

# Single name part attribute is fine
>>> spark.sql("SELECT col FROM parquet.testdata").show()
+---+
|col|
+---+
|  1|
|  0|
|  2|
+---+

# Name part with the table reference fails
>>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show()

AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: 
[col]; line 1 pos 7;
'Project ['parquet.testdata.col]
+- Relation[col#50L] parquet
{code}

This also reproduces on master at time of writing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35324) Spark SQL configs not respected in RDDs

2021-08-13 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398610#comment-17398610
 ] 

Willi Raschkowski commented on SPARK-35324:
---

[~mgekk], apologies for the direct ping. Do you know who could look at this? 
Just hoping to get more jobs upgraded to Spark 3.

To summarize the issue: As you know some datetime reads/writes/parses in Spark 
3 rely on additional configs, e.g. reading pre-1900 timestamps. It seems that 
even if you set those configs they don't get propagated to RDDs and jobs fail 
as if the config wasn't set.

> Spark SQL configs not respected in RDDs
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot 
> 2021-05-06 at 00.35.10.png
>
>
> When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in 
> actions on the resulting dataframe. But it's ignored in actions on 
> dataframe's RDD.
> E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to 
> be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as 
> you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the 
> override and read the config value as {{EXCEPTION}}.
> For instance:
> {code:java|title=test.csv}
> date
> 2/6/18
> {code}
> {code:java}
> scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
> scala> val df = {
>  |   spark.read
>  | .option("header", "true")
>  | .option("dateFormat", "MM/dd/yy")
>  | .schema("date date")
>  | .csv("/Users/wraschkowski/Downloads/test.csv")
>  | }
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> +--+  
>   
> |  date|
> +--+
> |2018-02-06|
> +--+
> scala> df.count
> res3: Long = 1
> scala> df.rdd.count
> 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
> set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
> before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime 
> string.
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
>   at 
> scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396)
>   at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collect

[jira] [Comment Edited] (SPARK-35324) Spark SQL configs not respected in RDDs

2021-08-13 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398607#comment-17398607
 ] 

Willi Raschkowski edited comment on SPARK-35324 at 8/13/21, 11:44 AM:
--

Here's an example of that:

{code:title=Missing config error}
Lost task 20.0 in stage 1.0 (TID 1, redacted, executor 1): 
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:474)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:477)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkUpgradeException: You may get a different 
result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or 
timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as 
the files may be read by Spark 2.x or legacy versions of Hive later, which uses 
a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic 
Gregorian calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during writing, to get maximum 
interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 
'CORRECTED' to write the datetime values as it is, if you are 100% sure that 
the written files will only be read by Spark 3.0+ or other systems that use 
Proleptic Gregorian calendar.
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInWrite(DataSourceUtils.scala:143)
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteDateRebaseFuncInWrite$1(DataSourceUtils.scala:163)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4(ParquetWriteSupport.scala:169)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4$adapted(ParquetWriteSupport.scala:168)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:463)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:136)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:451)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:136)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
... 9 more
{code}

where we're already setting {{...datetimeRebaseModeInWrite: legacy}}.

I appreciate that we can use SQL instead of RDDs for most use cases but in some 
cases we're still forced to go down to RDDs and since Spark 3 we're effectively 
blocking those cases from reading / writing old datetimes.


was (Author: raschkowski):
Here's an example of that:

{code:title=Rebase error}
Lost task 20.0 in stage 1.0 (

[jira] [Comment Edited] (SPARK-35324) Spark SQL configs not respected in RDDs

2021-08-13 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398607#comment-17398607
 ] 

Willi Raschkowski edited comment on SPARK-35324 at 8/13/21, 11:43 AM:
--

Here's an example of that:

{code:title=Rebase error}
Lost task 20.0 in stage 1.0 (TID 1, redacted, executor 1): 
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:474)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:477)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkUpgradeException: You may get a different 
result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or 
timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as 
the files may be read by Spark 2.x or legacy versions of Hive later, which uses 
a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic 
Gregorian calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during writing, to get maximum 
interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 
'CORRECTED' to write the datetime values as it is, if you are 100% sure that 
the written files will only be read by Spark 3.0+ or other systems that use 
Proleptic Gregorian calendar.
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInWrite(DataSourceUtils.scala:143)
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteDateRebaseFuncInWrite$1(DataSourceUtils.scala:163)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4(ParquetWriteSupport.scala:169)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4$adapted(ParquetWriteSupport.scala:168)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:463)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:136)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:451)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:136)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
... 9 more
{code}

where we're already setting {{...datetimeRebaseModeInWrite: legacy}}.

I appreciate that we can use SQL instead of RDDs for most use cases but in some 
cases we're still forced to go down to RDDs and since Spark 3 we're effectively 
blocking those cases from reading / writing old datetimes.


was (Author: raschkowski):
Here's an example of that:

{code:title=Rebase error}
Lost task 20.0 in stage 1.0 (TID 1, r

[jira] [Comment Edited] (SPARK-35324) Spark SQL configs not respected in RDDs

2021-08-13 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398607#comment-17398607
 ] 

Willi Raschkowski edited comment on SPARK-35324 at 8/13/21, 11:43 AM:
--

Here's an example of that:

{code:title=Rebase error}
Lost task 20.0 in stage 1.0 (TID 1, redacted, executor 1): 
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:474)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:477)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkUpgradeException: You may get a different 
result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or 
timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as 
the files may be read by Spark 2.x or legacy versions of Hive later, which uses 
a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic 
Gregorian calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during writing, to get maximum 
interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 
'CORRECTED' to write the datetime values as it is, if you are 100% sure that 
the written files will only be read by Spark 3.0+ or other systems that use 
Proleptic Gregorian calendar.
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInWrite(DataSourceUtils.scala:143)
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteDateRebaseFuncInWrite$1(DataSourceUtils.scala:163)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4(ParquetWriteSupport.scala:169)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4$adapted(ParquetWriteSupport.scala:168)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:463)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:136)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:451)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:136)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
... 9 more
{code}

where we're already setting {{...rebaseModeInWrite: legacy}}.

I appreciate that we can use SQL instead of RDDs for most use cases but in some 
cases we're still forced to go down to RDDs and since Spark 3 we're effectively 
blocking those cases from reading / writing old datetimes.


was (Author: raschkowski):
Here's an example of that:

{code:title=Rebase error}
Lost task 20.0 in stage 1.0 (TID 1, redacted,

[jira] [Commented] (SPARK-35324) Spark SQL configs not respected in RDDs

2021-08-13 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398607#comment-17398607
 ] 

Willi Raschkowski commented on SPARK-35324:
---

Here's an example of that:

{code:title=Rebase error}
Lost task 20.0 in stage 1.0 (TID 1, redacted, executor 1): 
org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:474)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:477)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkUpgradeException: You may get a different 
result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or 
timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as 
the files may be read by Spark 2.x or legacy versions of Hive later, which uses 
a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic 
Gregorian calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during writing, to get maximum 
interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 
'CORRECTED' to write the datetime values as it is, if you are 100% sure that 
the written files will only be read by Spark 3.0+ or other systems that use 
Proleptic Gregorian calendar.
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInWrite(DataSourceUtils.scala:143)
at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteDateRebaseFuncInWrite$1(DataSourceUtils.scala:163)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4(ParquetWriteSupport.scala:169)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4$adapted(ParquetWriteSupport.scala:168)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:463)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:136)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:451)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:136)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
... 9 more
{code}

I appreciate that we can use SQL instead of RDDs for most use cases but in some 
cases we're still forced to go down to RDDs and since Spark 3 we're effectively 
blocking those cases from reading / writing old datetimes.

> Spark SQL configs not respected in RDDs
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Typ

[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-26 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387298#comment-17387298
 ] 

Willi Raschkowski commented on SPARK-36034:
---

[~maxgekk], much appreciated, thanks!

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.0.4, 3.3.0
>
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-07 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376795#comment-17376795
 ] 

Willi Raschkowski edited comment on SPARK-36034 at 7/7/21, 7:00 PM:


To show you the metadata of the Parquet files:
{code:java|title=Corrected}
$ parquet-tools meta 
/Volumes/git/pds/190025/out/date_written_by_spark3_corrected 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0]
{code}
{code:java|title=Legacy}
$ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 
extra:   org.apache.spark.legacyDateTime = 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0]
{code}
Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has 
{{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} 
filter doesn't work. Maybe we need an "inverse rebase" before push-down?


was (Author: raschkowski):
To show you the metadata of the Parquet files:

{code:title=Corrected}
$ parquet-tools meta 
/Volumes/git/pds/190025/out/date_written_by_spark3_corrected 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0]
{code}

{code:title=Legacy}
$ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 
extra:   org.apache.spark.legacyDateTime = 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0]
{code}

Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has 
{{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} 
filter doesn't work. Maybe we need an "inverse rebase" before pushing down?

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https:

[jira] [Comment Edited] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-07 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376788#comment-17376788
 ] 

Willi Raschkowski edited comment on SPARK-36034 at 7/7/21, 6:58 PM:


You can probably guess but just for completeness, we're also seeing this in 
Spark 2-written files:
{code:java}
>>> spark.read.parquet("date_written_by_spark2").selectExpr("date", "date = 
>>> '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("date_written_by_spark2").filter("date = 
>>> '0001-01-01'").show()
++
|date|
++
++
{code}


was (Author: raschkowski):
You can probably guess but just for completeness, we're also seeing this in 
Spark 2-written files:
{code}
>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark2").selectExpr("date",
>>>  "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark2").filter("date
>>>  = '0001-01-01'").show()
++
|date|
++
++
{code}

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-07 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-36034:
--
Description: 
We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
Spark 3 with legacy rebase mode.

This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
{code:title=Good (Corrected Mode)}
>>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
>>> "CORRECTED")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")

>>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
>>> "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
>>> '0001-01-01'").show()
+--+
|  date|
+--+
|0001-01-01|
+--+
{code}

This is how we get incorrect results in _legacy_ mode, in this case the filter 
is dropping rows it shouldn't:
{code:title=Bad (Legacy Mode)}
In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
"LEGACY")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")

>>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
>>> "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
>>> '0001-01-01'").show()
++
|date|
++
++

>>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
>>> '0001-01-01'").explain()
== Physical Plan ==
*(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
+- *(1) ColumnarToRow
   +- FileScan parquet [date#154] Batched: true, DataFilters: 
[isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
 PartitionFilters: [], PushedFilters: [IsNotNull(date), 
EqualTo(date,0001-01-01)], ReadSchema: struct
{code}

  was:
We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
Spark 3 with legacy rebase mode.

This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
{code:title=Good (Corrected Mode)}
>>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
>>> "CORRECTED")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")

>>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
>>> "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
>>> '0001-01-01'").show()
+--+
|  date|
+--+
|0001-01-01|
+--+
{code}

This is how we get incorrect results in _legacy_ mode, in this case the filter 
is dropping rows it shouldn't:
{code:title=Bad (Legacy Mode)}
In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
"LEGACY")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")

>>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
>>> "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
>>> '0001-01-01'").show()
++
|date|
++
++

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
>>>  = '0001-01-01'").explain()
== Physical Plan ==
*(1) Filter (isnotnull(date#122) AND (date#122 = -719162))
+- *(1) ColumnarToRow
   +- FileScan parquet [date#122] Batched: true, DataFilters: 
[isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], 
PartitionFilters: [], PushedFilters: [IsNotNull(date), 
EqualTo(date,0001-01-01)], ReadSchema: struct
{code}


> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3

[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-07 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-36034:
--
Description: 
We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
Spark 3 with legacy rebase mode.

This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
{code:title=Good (Corrected Mode)}
>>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
>>> "CORRECTED")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")

>>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
>>> "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
>>> '0001-01-01'").show()
+--+
|  date|
+--+
|0001-01-01|
+--+
{code}

This is how we get incorrect results in _legacy_ mode, in this case the filter 
is dropping rows it shouldn't:
{code:title=Bad (Legacy Mode)}
In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
"LEGACY")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")

>>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
>>> "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
>>> '0001-01-01'").show()
++
|date|
++
++

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
>>>  = '0001-01-01'").explain()
== Physical Plan ==
*(1) Filter (isnotnull(date#122) AND (date#122 = -719162))
+- *(1) ColumnarToRow
   +- FileScan parquet [date#122] Batched: true, DataFilters: 
[isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], 
PartitionFilters: [], PushedFilters: [IsNotNull(date), 
EqualTo(date,0001-01-01)], ReadSchema: struct
{code}

  was:
We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
Spark 3 with legacy rebase mode.

This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
{code:title=Good (Corrected Mode)}
>>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
>>> "CORRECTED")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
>>>  "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
>>>  = '0001-01-01'").show()
+--+
|  date|
+--+
|0001-01-01|
+--+
{code}

This is how we get incorrect results in _legacy_ mode, in this case the filter 
is dropping rows it shouldn't:
{code:title=Bad (Legacy Mode)}
In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
"LEGACY")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
>>>  "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
>>>  = '0001-01-01'").show()
++
|date|
++
++

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
>>>  = '0001-01-01'").explain()
== Physical Plan ==
*(1) Filter (isnotnull(date#122) AND (date#122 = -719162))
+- *(1) ColumnarToRow
   +- FileScan parquet [date#122] Batched: true, DataFilters: 
[isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], 
PartitionFilters: [], PushedFilters: [IsNotNull(date), 
EqualTo(date,0001-01-01)], ReadSchema: struct
{code}


> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36

[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-07 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376796#comment-17376796
 ] 

Willi Raschkowski commented on SPARK-36034:
---

[~maxgekk], I think you might be most familiar with these code paths?

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
> >>>  "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
> >>>  = '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
> >>>  "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
> >>>  = '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
> >>>  = '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#122) AND (date#122 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#122] Batched: true, DataFilters: 
> [isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], 
> PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-07 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376795#comment-17376795
 ] 

Willi Raschkowski edited comment on SPARK-36034 at 7/7/21, 6:54 PM:


To show you the metadata of the Parquet files:

{code:title=Corrected}
$ parquet-tools meta 
/Volumes/git/pds/190025/out/date_written_by_spark3_corrected 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0]
{code}

{code:title=Legacy}
$ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 
extra:   org.apache.spark.legacyDateTime = 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0]
{code}

Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has 
{{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} 
filter doesn't work. Maybe we need an "inverse rebase" before pushing down?


was (Author: raschkowski):
To show you the metadata of the Parquet files:

{code:title=Corrected}
$ parquet-tools meta 
/Volumes/git/pds/190025/out/date_written_by_spark3_corrected 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0]
{code}

{code:title=Legacy}
$ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 
extra:   org.apache.spark.legacyDateTime = 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0]
{code}

Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has 
{{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} 
filter doesn't work.

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> P

[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-07 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376795#comment-17376795
 ] 

Willi Raschkowski commented on SPARK-36034:
---

To show you the metadata of the Parquet files:

{code:title=Corrected}
$ parquet-tools meta 
/Volumes/git/pds/190025/out/date_written_by_spark3_corrected 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0]
{code}

{code:title=Legacy}
$ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy 
file:
file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet
 
creator: parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:   org.apache.spark.version = 3.1.2 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]}
 
extra:   org.apache.spark.legacyDateTime = 

file schema: spark_schema 

date:REQUIRED INT32 L:DATE R:0 D:0

row group 1: RC:1 TS:49 OFFSET:4 

date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN 
ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0]
{code}

Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has 
{{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} 
filter doesn't work.

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
> >>>  "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
> >>>  = '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
> >>>  "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
> >>>  = '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
> >>>  = '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#122) AND (date#122 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#122] Batched: tru

[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-07 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376788#comment-17376788
 ] 

Willi Raschkowski commented on SPARK-36034:
---

You can probably guess but just for completeness, we're also seeing this in 
Spark 2-written files:
{code}
>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark2").selectExpr("date",
>>>  "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark2").filter("date
>>>  = '0001-01-01'").show()
++
|date|
++
++
{code}

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
> >>>  "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
> >>>  = '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
> >>>  "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
> >>>  = '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
> >>>  = '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#122) AND (date#122 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#122] Batched: true, DataFilters: 
> [isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], 
> PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-07 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-36034:
-

 Summary: Incorrect datetime filter when reading Parquet files 
written in legacy mode
 Key: SPARK-36034
 URL: https://issues.apache.org/jira/browse/SPARK-36034
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.1.2
Reporter: Willi Raschkowski


We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
Spark 3 with legacy rebase mode.

This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
{code:title=Good (Corrected Mode)}
>>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
>>> "CORRECTED")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
>>>  "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
>>>  = '0001-01-01'").show()
+--+
|  date|
+--+
|0001-01-01|
+--+
{code}

This is how we get incorrect results in _legacy_ mode, in this case the filter 
is dropping rows it shouldn't:
{code:title=Bad (Legacy Mode)}
In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
"LEGACY")

>>> spark.sql("SELECT DATE '0001-01-01' AS 
>>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3")

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date",
>>>  "date = '0001-01-01'").show()
+--+---+
|  date|(date = 0001-01-01)|
+--+---+
|0001-01-01|   true|
+--+---+

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
>>>  = '0001-01-01'").show()
++
|date|
++
++

>>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date
>>>  = '0001-01-01'").explain()
== Physical Plan ==
*(1) Filter (isnotnull(date#122) AND (date#122 = -719162))
+- *(1) ColumnarToRow
   +- FileScan parquet [date#122] Batched: true, DataFilters: 
[isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], 
PartitionFilters: [], PushedFilters: [IsNotNull(date), 
EqualTo(date,0001-01-01)], ReadSchema: struct
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35673:
--
Issue Type: Bug  (was: Task)

> Spark fails on unrecognized hint in subquery
> 
>
> Key: SPARK-35673
> URL: https://issues.apache.org/jira/browse/SPARK-35673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark queries to fail on unrecognized hints in subqueries. An example to 
> reproduce:
> {code:sql}
> SELECT /*+ use_hash */ 42;
> -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- 42
> SELECT *
> FROM (
> SELECT /*+ use_hash */ 42
> );
> -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- Error in query: unresolved operator 'Project [*];
> -- 'Project [*]
> -- +- SubqueryAlias __auto_generated_subquery_name
> --+- Project [42 AS 42#2]
> --   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35673:
--
Description: 
Spark queries to fail on unrecognized hints in subqueries. An example to 
reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}

  was:
Spark queries seem to fail on unrecognized hints in subqueries. An example to 
reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}


> Spark fails on unrecognized hint in subquery
> 
>
> Key: SPARK-35673
> URL: https://issues.apache.org/jira/browse/SPARK-35673
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark queries to fail on unrecognized hints in subqueries. An example to 
> reproduce:
> {code:sql}
> SELECT /*+ use_hash */ 42;
> -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- 42
> SELECT *
> FROM (
> SELECT /*+ use_hash */ 42
> );
> -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- Error in query: unresolved operator 'Project [*];
> -- 'Project [*]
> -- +- SubqueryAlias __auto_generated_subquery_name
> --+- Project [42 AS 42#2]
> --   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35673:
--
Description: 
Spark queries seem to fail on unrecognized hints in subqueries. An example to 
reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}

  was:
Spark fails on unrecognized hint in subquery.

To reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}



> Spark fails on unrecognized hint in subquery
> 
>
> Key: SPARK-35673
> URL: https://issues.apache.org/jira/browse/SPARK-35673
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark queries seem to fail on unrecognized hints in subqueries. An example to 
> reproduce:
> {code:sql}
> SELECT /*+ use_hash */ 42;
> -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- 42
> SELECT *
> FROM (
> SELECT /*+ use_hash */ 42
> );
> -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- Error in query: unresolved operator 'Project [*];
> -- 'Project [*]
> -- +- SubqueryAlias __auto_generated_subquery_name
> --+- Project [42 AS 42#2]
> --   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35673:
--
Description: 
Spark fails on unrecognized hint in subquery.

To reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}


  was:
Spark fails on unrecognized hint in subquery.

To reproduce, try
{code:sql}
-- This succeeds with warning
SELECT /*+ use_hash */ 42;

-- This fails
SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
{code}

The first statement gives you
{code}
21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
42
{code}
while the second statement gives you
{code}
21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
Error in query: unresolved operator 'Project [*];
'Project [*]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [42 AS 42#2]
  +- OneRowRelation
{code}


> Spark fails on unrecognized hint in subquery
> 
>
> Key: SPARK-35673
> URL: https://issues.apache.org/jira/browse/SPARK-35673
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark fails on unrecognized hint in subquery.
> To reproduce:
> {code:sql}
> SELECT /*+ use_hash */ 42;
> -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- 42
> SELECT *
> FROM (
> SELECT /*+ use_hash */ 42
> );
> -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- Error in query: unresolved operator 'Project [*];
> -- 'Project [*]
> -- +- SubqueryAlias __auto_generated_subquery_name
> --+- Project [42 AS 42#2]
> --   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)

Willi Raschkowski created SPARK-35673:
-

 Summary: Spark fails on unrecognized hint in subquery
 Key: SPARK-35673
 URL: https://issues.apache.org/jira/browse/SPARK-35673
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.1.2, 3.1.1, 3.0.2
Reporter: Willi Raschkowski


Spark fails on unrecognized hint in subquery.

To reproduce, try
{code:sql}
-- This succeeds with warning
SELECT /*+ use_hash */ 42;

-- This fails
SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
{code}

The first statement gives you
{code}
21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
42
{code}
while the second statement gives you
{code}
21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
Error in query: unresolved operator 'Project [*];
'Project [*]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [42 AS 42#2]
  +- OneRowRelation
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35324) Spark SQL configs not respected in RDDs

2021-05-26 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351728#comment-17351728
 ] 

Willi Raschkowski commented on SPARK-35324:
---

We found the same issue with 
{{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}. It seems that RDDs are 
generally not respecting {{spark.sql.*}} configs?

> Spark SQL configs not respected in RDDs
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot 
> 2021-05-06 at 00.35.10.png
>
>
> When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in 
> actions on the resulting dataframe. But it's ignored in actions on 
> dataframe's RDD.
> E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to 
> be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as 
> you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the 
> override and read the config value as {{EXCEPTION}}.
> For instance:
> {code:java|title=test.csv}
> date
> 2/6/18
> {code}
> {code:java}
> scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
> scala> val df = {
>  |   spark.read
>  | .option("header", "true")
>  | .option("dateFormat", "MM/dd/yy")
>  | .schema("date date")
>  | .csv("/Users/wraschkowski/Downloads/test.csv")
>  | }
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> +--+  
>   
> |  date|
> +--+
> |2018-02-06|
> +--+
> scala> df.count
> res3: Long = 1
> scala> df.rdd.count
> 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
> set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
> before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime 
> string.
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
>   at 
> scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396)
>   at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866)
>   at org.apache.spark.rdd.RDD.$an

[jira] [Updated] (SPARK-35324) Spark SQL configs not respected in RDDs

2021-05-26 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35324:
--
Summary: Spark SQL configs not respected in RDDs  (was: Time parser policy 
not respected in RDD)

> Spark SQL configs not respected in RDDs
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot 
> 2021-05-06 at 00.35.10.png
>
>
> When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in 
> actions on the resulting dataframe. But it's ignored in actions on 
> dataframe's RDD.
> E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to 
> be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as 
> you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the 
> override and read the config value as {{EXCEPTION}}.
> For instance:
> {code:java|title=test.csv}
> date
> 2/6/18
> {code}
> {code:java}
> scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
> scala> val df = {
>  |   spark.read
>  | .option("header", "true")
>  | .option("dateFormat", "MM/dd/yy")
>  | .schema("date date")
>  | .csv("/Users/wraschkowski/Downloads/test.csv")
>  | }
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> +--+  
>   
> |  date|
> +--+
> |2018-02-06|
> +--+
> scala> df.count
> res3: Long = 1
> scala> df.rdd.count
> 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
> set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
> before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime 
> string.
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
>   at 
> scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396)
>   at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866)
>   at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253)
>   at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)
>

[jira] [Commented] (SPARK-35324) Time parser policy not respected in RDD

2021-05-05 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339939#comment-17339939
 ] 

Willi Raschkowski commented on SPARK-35324:
---

This also reproduces if a launch the shell with {{--conf 
"spark.sql.legacy.timeParserPolicy=legacy"}}; just to prove that this isn't 
because I set the config via {{spark.conf.set}}.

> Time parser policy not respected in RDD
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot 
> 2021-05-06 at 00.35.10.png
>
>
> When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in 
> actions on the resulting dataframe. But it's ignored in actions on 
> dataframe's RDD.
> E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to 
> be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as 
> you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the 
> override and read the config value as {{EXCEPTION}}.
> For instance:
> {code:java|title=test.csv}
> date
> 2/6/18
> {code}
> {code:java}
> scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
> scala> val df = {
>  |   spark.read
>  | .option("header", "true")
>  | .option("dateFormat", "MM/dd/yy")
>  | .schema("date date")
>  | .csv("/Users/wraschkowski/Downloads/test.csv")
>  | }
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> +--+  
>   
> |  date|
> +--+
> |2018-02-06|
> +--+
> scala> df.count
> res3: Long = 1
> scala> df.rdd.count
> 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
> set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
> before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime 
> string.
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
>   at 
> scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396)
>   at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866)
>   at org

[jira] [Commented] (SPARK-35324) Time parser policy not respected in RDD

2021-05-05 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339938#comment-17339938
 ] 

Willi Raschkowski commented on SPARK-35324:
---

I think the difference might have to do with the fact that in the RDD case the 
config isn't in the local properties of the {{TaskContext}}.
 * Stepping through the debugger, I see that both RDD and Dataset decide on 
using or not using the legacy date formatter in 
[{{DateFormatter.getFormatter}}|https://github.com/apache/spark/blob/4fe4b65d9e4017654c93c8f7957ae3edbd270d0b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L161].
 * Then in 
[{{SQLConf.get}}|https://github.com/apache/spark/blob/4fe4b65d9e4017654c93c8f7957ae3edbd270d0b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L172],
 both cases find a {{TaskContext}} and no {{existingConf}}. So they create a 
new {{ReadOnlySQLConf}} from the {{TaskContext}} object.
 * RDD and Dataset code path differ in the local properties they find on the 
{{TaskContext}} 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/ReadOnlySQLConf.scala#L32].
 The Dataset code path has {{spark.sql.legacy.timeParserPolicy}} in the local 
properties, but the RDD path doesn't. The {{ReadOnlySQLConf}} is created from 
the local properties, so in the RDD path the resulting config object doesn't 
have an override for {{spark.sql.legacy.timeParserPolicy}}.

Just to show you what I see in the debugger. In both cases we stopped 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/ReadOnlySQLConf.scala#L32].

!Screen Shot 2021-05-06 at 00.35.10.png|width=300! !Screen Shot 2021-05-06 at 
00.33.10.png|width=300! 

> Time parser policy not respected in RDD
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot 
> 2021-05-06 at 00.35.10.png
>
>
> When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in 
> actions on the resulting dataframe. But it's ignored in actions on 
> dataframe's RDD.
> E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to 
> be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as 
> you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the 
> override and read the config value as {{EXCEPTION}}.
> For instance:
> {code:java|title=test.csv}
> date
> 2/6/18
> {code}
> {code:java}
> scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
> scala> val df = {
>  |   spark.read
>  | .option("header", "true")
>  | .option("dateFormat", "MM/dd/yy")
>  | .schema("date date")
>  | .csv("/Users/wraschkowski/Downloads/test.csv")
>  | }
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> +--+  
>   
> |  date|
> +--+
> |2018-02-06|
> +--+
> scala> df.count
> res3: Long = 1
> scala> df.rdd.count
> 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
> set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
> before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime 
> string.
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
>   at 
> scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
>

[jira] [Updated] (SPARK-35324) Time parser policy not respected in RDD

2021-05-05 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35324:
--
Attachment: Screen Shot 2021-05-06 at 00.35.10.png
Screen Shot 2021-05-06 at 00.33.10.png

> Time parser policy not respected in RDD
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Willi Raschkowski
>Priority: Major
> Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot 
> 2021-05-06 at 00.35.10.png
>
>
> When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in 
> actions on the resulting dataframe. But it's ignored in actions on 
> dataframe's RDD.
> E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to 
> be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as 
> you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the 
> override and read the config value as {{EXCEPTION}}.
> For instance:
> {code:java|title=test.csv}
> date
> 2/6/18
> {code}
> {code:java}
> scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
> scala> val df = {
>  |   spark.read
>  | .option("header", "true")
>  | .option("dateFormat", "MM/dd/yy")
>  | .schema("date date")
>  | .csv("/Users/wraschkowski/Downloads/test.csv")
>  | }
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> +--+  
>   
> |  date|
> +--+
> |2018-02-06|
> +--+
> scala> df.count
> res3: Long = 1
> scala> df.rdd.count
> 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
> set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
> before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime 
> string.
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
>   at 
> scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396)
>   at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866)
>   at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253)
>   at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)

[jira] [Updated] (SPARK-35324) Time parser policy not respected in RDD

2021-05-05 Thread Willi Raschkowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35324:
--
Description: 
When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in actions 
on the resulting dataframe. But it's ignored in actions on dataframe's RDD.

E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to be 
set to {{LEGACY}}. If you set the config, {{df.collect}} will work as you'd 
expect. However, {{df.collect.rdd}} will fail because it'll ignore the override 
and read the config value as {{EXCEPTION}}.

For instance:
{code:java|title=test.csv}
date
2/6/18
{code}
{code:java}
scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")

scala> val df = {
 |   spark.read
 | .option("header", "true")
 | .option("dateFormat", "MM/dd/yy")
 | .schema("date date")
 | .csv("/Users/wraschkowski/Downloads/test.csv")
 | }
df: org.apache.spark.sql.DataFrame = [date: date]

scala> df.show
+--+
|  date|
+--+
|2018-02-06|
+--+


scala> df.count
res3: Long = 1

scala> df.rdd.count
21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
org.apache.spark.SparkUpgradeException: You may get a different result due to 
the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before 
Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
at 
org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
at 
org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at 
org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
at 
scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396)
at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866)
at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253)
at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)
at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/jav

[jira] [Comment Edited] (SPARK-35324) Time parser policy not respected in RDD

2021-05-05 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339933#comment-17339933
 ] 

Willi Raschkowski edited comment on SPARK-35324 at 5/5/21, 11:26 PM:
-

For what it's worth, I only managed to reproduce with a reader. Creating a 
dataframe from a {{Seq}} works fine:
{code:java}
scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", 
"MM/dd/yy")).collect
res7: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06])


scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", 
"MM/dd/yy")).rdd.collect
res8: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06]){code}


was (Author: raschkowski):
For what it's worth, I only managed to reproduce with a reader:
{code:java}
scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", 
"MM/dd/yy")).collect
res7: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06])


scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", 
"MM/dd/yy")).rdd.collect
res8: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06]){code}

> Time parser policy not respected in RDD
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in 
> actions on the resulting dataframe. But it's ignored in actions on 
> dataframe's RDD.
> E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to 
> be set to {{LEGACY}}. If you set the config, {{df.count}} will work as you'd 
> expect. However, {{df.count.rdd}} will fail because it'll ignore the override 
> and read the config value as {{EXCEPTION}}.
> For instance:
> {code:java|title=test.csv}
> date
> 2/6/18
> {code}
> {code:java}
> scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
> scala> val df = {
>  |   spark.read
>  | .option("header", "true")
>  | .option("dateFormat", "MM/dd/yy")
>  | .schema("date date")
>  | .csv("/Users/wraschkowski/Downloads/test.csv")
>  | }
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> +--+  
>   
> |  date|
> +--+
> |2018-02-06|
> +--+
> scala> df.count
> res3: Long = 1
> scala> df.rdd.count
> 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
> set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
> before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime 
> string.
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
>   at 
> scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396)
>   at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at

[jira] [Commented] (SPARK-35324) Time parser policy not respected in RDD

2021-05-05 Thread Willi Raschkowski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339933#comment-17339933
 ] 

Willi Raschkowski commented on SPARK-35324:
---

For what it's worth, I only managed to reproduce with a reader:
{code:java}
scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", 
"MM/dd/yy")).collect
res7: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06])


scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", 
"MM/dd/yy")).rdd.collect
res8: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06]){code}

> Time parser policy not respected in RDD
> ---
>
> Key: SPARK-35324
> URL: https://issues.apache.org/jira/browse/SPARK-35324
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Willi Raschkowski
>Priority: Major
>
> When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in 
> actions on the resulting dataframe. But it's ignored in actions on 
> dataframe's RDD.
> E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to 
> be set to {{LEGACY}}. If you set the config, {{df.count}} will work as you'd 
> expect. However, {{df.count.rdd}} will fail because it'll ignore the override 
> and read the config value as {{EXCEPTION}}.
> For instance:
> {code:java|title=test.csv}
> date
> 2/6/18
> {code}
> {code:java}
> scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy")
> scala> val df = {
>  |   spark.read
>  | .option("header", "true")
>  | .option("dateFormat", "MM/dd/yy")
>  | .schema("date date")
>  | .csv("/Users/wraschkowski/Downloads/test.csv")
>  | }
> df: org.apache.spark.sql.DataFrame = [date: date]
> scala> df.show
> +--+  
>   
> |  date|
> +--+
> |2018-02-06|
> +--+
> scala> df.count
> res3: Long = 1
> scala> df.rdd.count
> 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can 
> set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
> before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime 
> string.
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61)
>   at 
> scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396)
>   at 
> org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
>   at 
> org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterato

1 2 >

1 - 100 of 104 matches

Mail list logo