[jira] [Resolved] (SPARK-48371) Upgrade to Parquet 1.14
[ https://issues.apache.org/jira/browse/SPARK-48371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski resolved SPARK-48371. --- Resolution: Duplicate > Upgrade to Parquet 1.14 > --- > > Key: SPARK-48371 > URL: https://issues.apache.org/jira/browse/SPARK-48371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Willi Raschkowski >Priority: Major > > There's a bug in Parquet > [(PARQUET-2454)|https://issues.apache.org/jira/browse/PARQUET-2454] where > Parquet in Spark occasionally writes out truncated files with bytes missing > at the end. > The fix was released in Parquet 1.14.0. [See > changelog.|https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48371) Upgrade to Parquet 1.14
[ https://issues.apache.org/jira/browse/SPARK-48371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848166#comment-17848166 ] Willi Raschkowski commented on SPARK-48371: --- Apologies, I noticed we already have SPARK-48177. > Upgrade to Parquet 1.14 > --- > > Key: SPARK-48371 > URL: https://issues.apache.org/jira/browse/SPARK-48371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 >Reporter: Willi Raschkowski >Priority: Major > > There's a bug in Parquet > [(PARQUET-2454)|https://issues.apache.org/jira/browse/PARQUET-2454] where > Parquet in Spark occasionally writes out truncated files with bytes missing > at the end. > The fix was released in Parquet 1.14.0. [See > changelog.|https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48371) Upgrade to Parquet 1.14
Willi Raschkowski created SPARK-48371: - Summary: Upgrade to Parquet 1.14 Key: SPARK-48371 URL: https://issues.apache.org/jira/browse/SPARK-48371 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.1 Reporter: Willi Raschkowski There's a bug in Parquet [(PARQUET-2454)|https://issues.apache.org/jira/browse/PARQUET-2454] where Parquet in Spark occasionally writes out truncated files with bytes missing at the end. The fix was released in Parquet 1.14.0. [See changelog.|https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47307) Spark 3.3 produces invalid base64
[ https://issues.apache.org/jira/browse/SPARK-47307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824115#comment-17824115 ] Willi Raschkowski commented on SPARK-47307: --- The behavior change is as follows: * Spark 3.2, [here|https://github.com/apache/spark/blob/e428fe902bb1f12cea973de7fe4b885ae69fd6ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2206], was using Apache's encoder like this: {{{}CommonsBase64.encodeBase64(bytes.asInstanceOf[Array[Byte]]){}}}. * That {{encodeBase64}} call does _not_ chunk [its output|https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64(byte%5B%5D,boolean,boolean,int)]. * Falsely assuming that Apache's encoder would follow the RC2045 / MIME spec, Spark 3.3 started using [Java's MIME encoder|https://github.com/apache/spark/blob/f74867bddfbcdd4d08076db36851e88b15e66556/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2431]. The MIME encoder [follows the RFC2045 spec and _does chunk_|https://datatracker.ietf.org/doc/html/rfc2045#section-6.8:~:text=76%0A%20%20%20%20%20%20%20%20%20%20characters%20long.]. * That chunking is what introduced those {{\r\n}} separators. > Spark 3.3 produces invalid base64 > - > > Key: SPARK-47307 > URL: https://issues.apache.org/jira/browse/SPARK-47307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Blocker > Labels: correctness > > SPARK-37820 was introduced in Spark 3.3 and breaks behavior of {{base64}} > (which is fine but shouldn't happen between minor version). > {code:title=Spark 3.2} > >>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0] > 'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYQ==' > {code} > Note the different output in Spark 3.3 (the addition of {{\r\n}} newlines). > {code:title=Spark 3.3} > >>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0] > 'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh\r\nYQ==' > {code} > The former decodes fine with the {{base64}} on my machine but the latter does > not: > {code} > $ pbpaste | base64 --decode > aa% > $ pbpaste | base64 --decode > base64: stdin: (null): error decoding base64 input stream > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47307) Spark 3.3 breaks base64
[ https://issues.apache.org/jira/browse/SPARK-47307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-47307: -- Description: SPARK-37820 was introduced in Spark 3.3 and breaks behavior of {{base64}} (which is fine but shouldn't happen between minor version). {code:title=Spark 3.2} >>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0] 'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYQ==' {code} Note the different output in Spark 3.3 (the addition of {{\r\n}} newlines). {code:title=Spark 3.3} >>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0] 'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh\r\nYQ==' {code} The former decodes fine with the {{base64}} on my machine but the latter does not: {code} $ pbpaste | base64 --decode aa% $ pbpaste | base64 --decode base64: stdin: (null): error decoding base64 input stream {code} was: SPARK-37820 was introduced in Spark 3.3 and breaks behavior of {{base64}} (which is fine but shouldn't happen between minor version). {code:title=Spark 3.2} In [1]: lorem = """ ...: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac laoreet metus. Curabitur sollicitudin magna ac lacinia ornare. Pellentesque semper elit nunc, vestibulum ultricies elit bibendum sed. Praesent vehicula sodales odio, tincidunt laoreet diam laoreet non. Mauris condimentum lacinia laoreet. Mauris ultrices urna ut sapien dictum commodo faucibu ...: s nec nisl. Nulla mattis tincidunt orci eget semper. Etiam dignissim finibus mi et lacinia. Curabitur vitae sem commodo, euismod nisl at, molestie tortor. Quisque ornare, tortor a vulputate molestie, augue lectus blandit erat, nec efficitur justo metus ut dui. Morbi purus lectus, accumsan vitae sem vitae, faucibus aliquet quam. Donec euismod, nulla a por ...: ta hendrerit, lorem magna vestibulum nunc, et eleifend quam metus quis purus. ...: ...: Praesent id velit scelerisque, varius eros ac, cursus quam. Duis mollis facilisis ante a dictum. Nunc nisl sem, fermentum non sagittis non, convallis nec lectus. Praesent nec nulla sed velit interdum tristique sit amet non nisl. Pellentesque rhoncus libero urna, eget condimentum orci tristique in. Donec a felis eu nisl laoreet efficitur. Integer velit ju ...: sto, elementum a faucibus ac, fringilla ac nibh. ...: """ In [2]: spark.sql(f"""SELECT base64('{lorem}') AS base64""").collect()[0][0] Out[2]: 'CkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0LCBjb25zZWN0ZXR1ciBhZGlwaXNjaW5nIGVsaXQuIE51bmMgYWMgbGFvcmVldCBtZXR1cy4gQ3VyYWJpdHVyIHNvbGxpY2l0dWRpbiBtYWduYSBhYyBsYWNpbmlhIG9ybmFyZS4gUGVsbGVudGVzcXVlIHNlbXBlciBlbGl0IG51bmMsIHZlc3RpYnVsdW0gdWx0cmljaWVzIGVsaXQgYmliZW5kdW0gc2VkLiBQcmFlc2VudCB2ZWhpY3VsYSBzb2RhbGVzIG9kaW8sIHRpbmNpZHVudCBsYW9yZWV0IGRpYW0gbGFvcmVldCBub24uIE1hdXJpcyBjb25kaW1lbnR1bSBsYWNpbmlhIGxhb3JlZXQuIE1hdXJpcyB1bHRyaWNlcyB1cm5hIHV0IHNhcGllbiBkaWN0dW0gY29tbW9kbyBmYXVjaWJ1cyBuZWMgbmlzbC4gTnVsbGEgbWF0dGlzIHRpbmNpZHVudCBvcmNpIGVnZXQgc2VtcGVyLiBFdGlhbSBkaWduaXNzaW0gZmluaWJ1cyBtaSBldCBsYWNpbmlhLiBDdXJhYml0dXIgdml0YWUgc2VtIGNvbW1vZG8sIGV1aXNtb2QgbmlzbCBhdCwgbW9sZXN0aWUgdG9ydG9yLiBRdWlzcXVlIG9ybmFyZSwgdG9ydG9yIGEgdnVscHV0YXRlIG1vbGVzdGllLCBhdWd1ZSBsZWN0dXMgYmxhbmRpdCBlcmF0LCBuZWMgZWZmaWNpdHVyIGp1c3RvIG1ldHVzIHV0IGR1aS4gTW9yYmkgcHVydXMgbGVjdHVzLCBhY2N1bXNhbiB2aXRhZSBzZW0gdml0YWUsIGZhdWNpYnVzIGFsaXF1ZXQgcXVhbS4gRG9uZWMgZXVpc21vZCwgbnVsbGEgYSBwb3J0YSBoZW5kcmVyaXQsIGxvcmVtIG1hZ25hIHZlc3RpYnVsdW0gbnVuYywgZXQgZWxlaWZlbmQgcXVhbSBtZXR1cyBxdWlzIHB1cnVzLgoKUHJhZXNlbnQgaWQgdmVsaXQgc2NlbGVyaXNxdWUsIHZhcml1cyBlcm9zIGFjLCBjdXJzdXMgcXVhbS4gRHVpcyBtb2xsaXMgZmFjaWxpc2lzIGFudGUgYSBkaWN0dW0uIE51bmMgbmlzbCBzZW0sIGZlcm1lbnR1bSBub24gc2FnaXR0aXMgbm9uLCBjb252YWxsaXMgbmVjIGxlY3R1cy4gUHJhZXNlbnQgbmVjIG51bGxhIHNlZCB2ZWxpdCBpbnRlcmR1bSB0cmlzdGlxdWUgc2l0IGFtZXQgbm9uIG5pc2wuIFBlbGxlbnRlc3F1ZSByaG9uY3VzIGxpYmVybyB1cm5hLCBlZ2V0IGNvbmRpbWVudHVtIG9yY2kgdHJpc3RpcXVlIGluLiBEb25lYyBhIGZlbGlzIGV1IG5pc2wgbGFvcmVldCBlZmZpY2l0dXIuIEludGVnZXIgdmVsaXQganVzdG8sIGVsZW1lbnR1bSBhIGZhdWNpYnVzIGFjLCBmcmluZ2lsbGEgYWMgbmliaC4K' {code} Note the different output in Spark 3.3 (the addition of {{\r\n}} newlines). {code:title=Spark 3.3} In [1]: lorem = """ ...: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac laoreet metus. Curabitur sollicitudin magna ac lacinia ornare. Pellentesque semper elit nunc, vestibulum ultricies elit bibendum sed. Praesent vehicula sodales odio, tincidunt laoreet diam laoreet non. Mauris condimentum lacinia laoreet. Mauris ultrices urna ut sapien dictum commodo faucibu ...: s nec nisl. Nulla mattis tincidunt orci eget semper. Etiam dignissim finibus mi et lacinia. Curabitur vitae sem commodo, euismod nisl at, molestie tortor. Quisque ornare, tortor a vulputate molestie, augue lectus blandit erat, nec efficitur justo met
[jira] [Created] (SPARK-47307) Spark 3.3 breaks base64
Willi Raschkowski created SPARK-47307: - Summary: Spark 3.3 breaks base64 Key: SPARK-47307 URL: https://issues.apache.org/jira/browse/SPARK-47307 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Willi Raschkowski SPARK-37820 was introduced in Spark 3.3 and breaks behavior of {{base64}} (which is fine but shouldn't happen between minor version). {code:title=Spark 3.2} In [1]: lorem = """ ...: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac laoreet metus. Curabitur sollicitudin magna ac lacinia ornare. Pellentesque semper elit nunc, vestibulum ultricies elit bibendum sed. Praesent vehicula sodales odio, tincidunt laoreet diam laoreet non. Mauris condimentum lacinia laoreet. Mauris ultrices urna ut sapien dictum commodo faucibu ...: s nec nisl. Nulla mattis tincidunt orci eget semper. Etiam dignissim finibus mi et lacinia. Curabitur vitae sem commodo, euismod nisl at, molestie tortor. Quisque ornare, tortor a vulputate molestie, augue lectus blandit erat, nec efficitur justo metus ut dui. Morbi purus lectus, accumsan vitae sem vitae, faucibus aliquet quam. Donec euismod, nulla a por ...: ta hendrerit, lorem magna vestibulum nunc, et eleifend quam metus quis purus. ...: ...: Praesent id velit scelerisque, varius eros ac, cursus quam. Duis mollis facilisis ante a dictum. Nunc nisl sem, fermentum non sagittis non, convallis nec lectus. Praesent nec nulla sed velit interdum tristique sit amet non nisl. Pellentesque rhoncus libero urna, eget condimentum orci tristique in. Donec a felis eu nisl laoreet efficitur. Integer velit ju ...: sto, elementum a faucibus ac, fringilla ac nibh. ...: """ In [2]: spark.sql(f"""SELECT base64('{lorem}') AS base64""").collect()[0][0] Out[2]: 'CkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0LCBjb25zZWN0ZXR1ciBhZGlwaXNjaW5nIGVsaXQuIE51bmMgYWMgbGFvcmVldCBtZXR1cy4gQ3VyYWJpdHVyIHNvbGxpY2l0dWRpbiBtYWduYSBhYyBsYWNpbmlhIG9ybmFyZS4gUGVsbGVudGVzcXVlIHNlbXBlciBlbGl0IG51bmMsIHZlc3RpYnVsdW0gdWx0cmljaWVzIGVsaXQgYmliZW5kdW0gc2VkLiBQcmFlc2VudCB2ZWhpY3VsYSBzb2RhbGVzIG9kaW8sIHRpbmNpZHVudCBsYW9yZWV0IGRpYW0gbGFvcmVldCBub24uIE1hdXJpcyBjb25kaW1lbnR1bSBsYWNpbmlhIGxhb3JlZXQuIE1hdXJpcyB1bHRyaWNlcyB1cm5hIHV0IHNhcGllbiBkaWN0dW0gY29tbW9kbyBmYXVjaWJ1cyBuZWMgbmlzbC4gTnVsbGEgbWF0dGlzIHRpbmNpZHVudCBvcmNpIGVnZXQgc2VtcGVyLiBFdGlhbSBkaWduaXNzaW0gZmluaWJ1cyBtaSBldCBsYWNpbmlhLiBDdXJhYml0dXIgdml0YWUgc2VtIGNvbW1vZG8sIGV1aXNtb2QgbmlzbCBhdCwgbW9sZXN0aWUgdG9ydG9yLiBRdWlzcXVlIG9ybmFyZSwgdG9ydG9yIGEgdnVscHV0YXRlIG1vbGVzdGllLCBhdWd1ZSBsZWN0dXMgYmxhbmRpdCBlcmF0LCBuZWMgZWZmaWNpdHVyIGp1c3RvIG1ldHVzIHV0IGR1aS4gTW9yYmkgcHVydXMgbGVjdHVzLCBhY2N1bXNhbiB2aXRhZSBzZW0gdml0YWUsIGZhdWNpYnVzIGFsaXF1ZXQgcXVhbS4gRG9uZWMgZXVpc21vZCwgbnVsbGEgYSBwb3J0YSBoZW5kcmVyaXQsIGxvcmVtIG1hZ25hIHZlc3RpYnVsdW0gbnVuYywgZXQgZWxlaWZlbmQgcXVhbSBtZXR1cyBxdWlzIHB1cnVzLgoKUHJhZXNlbnQgaWQgdmVsaXQgc2NlbGVyaXNxdWUsIHZhcml1cyBlcm9zIGFjLCBjdXJzdXMgcXVhbS4gRHVpcyBtb2xsaXMgZmFjaWxpc2lzIGFudGUgYSBkaWN0dW0uIE51bmMgbmlzbCBzZW0sIGZlcm1lbnR1bSBub24gc2FnaXR0aXMgbm9uLCBjb252YWxsaXMgbmVjIGxlY3R1cy4gUHJhZXNlbnQgbmVjIG51bGxhIHNlZCB2ZWxpdCBpbnRlcmR1bSB0cmlzdGlxdWUgc2l0IGFtZXQgbm9uIG5pc2wuIFBlbGxlbnRlc3F1ZSByaG9uY3VzIGxpYmVybyB1cm5hLCBlZ2V0IGNvbmRpbWVudHVtIG9yY2kgdHJpc3RpcXVlIGluLiBEb25lYyBhIGZlbGlzIGV1IG5pc2wgbGFvcmVldCBlZmZpY2l0dXIuIEludGVnZXIgdmVsaXQganVzdG8sIGVsZW1lbnR1bSBhIGZhdWNpYnVzIGFjLCBmcmluZ2lsbGEgYWMgbmliaC4K' {code} Note the different output in Spark 3.3 (the addition of {{\r\n}} newlines). {code:title=Spark 3.3} In [1]: lorem = """ ...: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac laoreet metus. Curabitur sollicitudin magna ac lacinia ornare. Pellentesque semper elit nunc, vestibulum ultricies elit bibendum sed. Praesent vehicula sodales odio, tincidunt laoreet diam laoreet non. Mauris condimentum lacinia laoreet. Mauris ultrices urna ut sapien dictum commodo faucibu ...: s nec nisl. Nulla mattis tincidunt orci eget semper. Etiam dignissim finibus mi et lacinia. Curabitur vitae sem commodo, euismod nisl at, molestie tortor. Quisque ornare, tortor a vulputate molestie, augue lectus blandit erat, nec efficitur justo metus ut dui. Morbi purus lectus, accumsan vitae sem vitae, faucibus aliquet quam. Donec euismod, nulla a por ...: ta hendrerit, lorem magna vestibulum nunc, et eleifend quam metus quis purus. ...: ...: Praesent id velit scelerisque, varius eros ac, cursus quam. Duis mollis facilisis ante a dictum. Nunc nisl sem, fermentum non sagittis non, convallis nec lectus. Praesent nec nulla sed velit interdum tristique sit amet non nisl. Pellentesque rhoncus libero urna, eget condimentum orci tristique in. Donec a felis eu nisl laoreet efficitur. Integer velit ju ...: sto, elementum a faucibus ac, fringilla ac nibh. ...: """ In [2]: spark.sql(f"""SELECT base64('{lorem}') AS base64""").collect()[0][0]
[jira] [Comment Edited] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811850#comment-17811850 ] Willi Raschkowski edited comment on SPARK-46893 at 1/29/24 12:15 PM: - cc [~dongjoon], for your awareness as PMC who's recently touched the UI. I'm wondering if we should file a CVE for this. was (Author: raschkowski): [~dongjoon], for your awareness as PMC who's recently touched the UI. I'm wondering if we should file a CVE for this. > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Labels: pull-request-available > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g., {{
[jira] [Commented] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811850#comment-17811850 ] Willi Raschkowski commented on SPARK-46893: --- [~dongjoon], for your awareness as PMC who's recently touched the UI. I'm wondering if we should file a CVE for this. > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Labels: pull-request-available > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g., {{
[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-46893: -- Description: Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} handlers) in the UI job and stage descriptions. The UI already has precaution to treat, e.g., {{
[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-46893: -- Description: Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} handlers) in the UI job and stage descriptions. The UI already has precaution to treat, e.g., {{
[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-46893: -- Attachment: Screenshot 2024-01-29 at 09.06.34.png > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g., {{
[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-46893: -- Summary: Remove inline scripts from UI descriptions (was: Sanitize UI descriptions from inline scripts) > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g., {{
[jira] [Created] (SPARK-46893) Sanitize UI descriptions from inline scripts
Willi Raschkowski created SPARK-46893: - Summary: Sanitize UI descriptions from inline scripts Key: SPARK-46893 URL: https://issues.apache.org/jira/browse/SPARK-46893 Project: Spark Issue Type: Bug Components: UI, Web UI Affects Versions: 3.4.1 Reporter: Willi Raschkowski Attachments: Screen Recording 2024-01-28 at 17.51.47.mov Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} handlers) in the UI job and stage descriptions. The UI already has precaution to treat, e.g., {{
[jira] [Updated] (SPARK-46893) Sanitize UI descriptions from inline scripts
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-46893: -- Attachment: Screen Recording 2024-01-28 at 17.51.47.mov > Sanitize UI descriptions from inline scripts > > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g., {{
[jira] [Commented] (SPARK-44767) Plugin API for PySpark and SparkR workers
[ https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788360#comment-17788360 ] Willi Raschkowski commented on SPARK-44767: --- [~gurwls223], curious what you think about this proposal. I know you're leaning towards dynamic environment selection for Spark Connect [(apache/spark#41215)|https://github.com/apache/spark/pull/41215] instead of relying on a single environment per Spark application or per host. At Palantir, we use conda-pack based environments with {{spark.archives}}. But that wasn't sufficient to make native library dependencies work. Internally, we implemented a {{ProcessBuilder}} plugin (using the [proposed API|https://github.com/apache/spark/pull/42440]). Among other things we use it to append the environment's {{bin/}} to the process' {{PATH}} variable or to discover Python module and non-Python binary locations outside the packaged environment. > Plugin API for PySpark and SparkR workers > - > > Key: SPARK-44767 > URL: https://issues.apache.org/jira/browse/SPARK-44767 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Labels: pull-request-available > > An API to customize Python and R workers allows for extensibility beyond what > can be expressed via static configs and environment variables like, e.g., > {{spark.pyspark.python}}. > A use case for this is overriding {{PATH}} when using {{spark.archives}} > with, say, conda-pack (as documented > [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]). > Some packages rely on binaries. And if we want to use those packages in > Spark, we need to include their binaries in the {{PATH}}. > But we can't set the {{PATH}} via some config because 1) the environment with > its binaries may be at a dynamic location (archives are unpacked on the > driver [into a directory with random > name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]), > and 2) we may not want to override the {{PATH}} that's pre-configured on the > hosts. > Other use cases unlocked by this include overriding the executable > dynamically (e.g., to select a version) or forking/redirecting the worker's > output stream. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44767) Plugin API for PySpark and SparkR workers
[ https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752926#comment-17752926 ] Willi Raschkowski commented on SPARK-44767: --- I put up a proposal implementation here: https://github.com/apache/spark/pull/42440 > Plugin API for PySpark and SparkR workers > - > > Key: SPARK-44767 > URL: https://issues.apache.org/jira/browse/SPARK-44767 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > > An API to customize Python and R workers allows for extensibility beyond what > can be expressed via static configs and environment variables like, e.g., > {{spark.pyspark.python}}. > A use case for this is overriding {{PATH}} when using {{spark.archives}} > with, say, conda-pack (as documented > [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]). > Some packages rely on binaries. And if we want to use those packages in > Spark, we need to include their binaries in the {{PATH}}. > But we can't set the {{PATH}} via some config because 1) the environment with > its binaries may be at a dynamic location (archives are unpacked on the > driver [into a directory with random > name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]), > and 2) we may not want to override the {{PATH}} that's pre-configured on the > hosts. > Other use cases unlocked by this include overriding the executable > dynamically (e.g., to select a version) or forking/redirecting the worker's > output stream. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44767) Plugin API for PySpark and SparkR workers
[ https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-44767: -- Summary: Plugin API for PySpark and SparkR workers (was: Plugin API for PySpark and SparkR subprocesses) > Plugin API for PySpark and SparkR workers > - > > Key: SPARK-44767 > URL: https://issues.apache.org/jira/browse/SPARK-44767 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > > An API to customize Python and R workers allows for extensibility beyond what > can be expressed via static configs and environment variables like, e.g., > {{spark.pyspark.python}}. > A use case for this is overriding {{PATH}} when using {{spark.archives}} > with, say, conda-pack (as documented > [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]). > Some packages rely on binaries. And if we want to use those packages in > Spark, we need to include their binaries in the {{PATH}}. > But we can't set the {{PATH}} via some config because 1) the environment with > its binaries may be at a dynamic location (archives are unpacked on the > driver [into a directory with random > name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]), > and 2) we may not want to override the {{PATH}} that's pre-configured on the > hosts. > Other use cases unlocked by this include overriding the executable > dynamically (e.g., to select a version) or forking/redirecting the worker's > output stream. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44767) Plugin API for PySpark and SparkR subprocesses
Willi Raschkowski created SPARK-44767: - Summary: Plugin API for PySpark and SparkR subprocesses Key: SPARK-44767 URL: https://issues.apache.org/jira/browse/SPARK-44767 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.4.1 Reporter: Willi Raschkowski An API to customize Python and R workers allows for extensibility beyond what can be expressed via static configs and environment variables like, e.g., {{spark.pyspark.python}}. A use case we had for this is overriding {{PATH}} when using {{spark.archives}} with, say, conda-pack (as documented [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]). Some packages rely on binaries. And if we want to use those packages in Spark, we need to include their binaries in the {{PATH}}. But we can't set the {{PATH}} via some config because 1) the environment with its binaries may be at a dynamic location (archives are unpacked on the driver [into a directory with random name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]), and 2) we may not want to override the {{PATH}} that's pre-configured on the hosts. Other use cases unlocked by this include overriding the executable dynamically (e.g., to select a version) or forking/redirecting the worker's output stream. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44767) Plugin API for PySpark and SparkR subprocesses
[ https://issues.apache.org/jira/browse/SPARK-44767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-44767: -- Description: An API to customize Python and R workers allows for extensibility beyond what can be expressed via static configs and environment variables like, e.g., {{spark.pyspark.python}}. A use case for this is overriding {{PATH}} when using {{spark.archives}} with, say, conda-pack (as documented [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]). Some packages rely on binaries. And if we want to use those packages in Spark, we need to include their binaries in the {{PATH}}. But we can't set the {{PATH}} via some config because 1) the environment with its binaries may be at a dynamic location (archives are unpacked on the driver [into a directory with random name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]), and 2) we may not want to override the {{PATH}} that's pre-configured on the hosts. Other use cases unlocked by this include overriding the executable dynamically (e.g., to select a version) or forking/redirecting the worker's output stream. was: An API to customize Python and R workers allows for extensibility beyond what can be expressed via static configs and environment variables like, e.g., {{spark.pyspark.python}}. A use case we had for this is overriding {{PATH}} when using {{spark.archives}} with, say, conda-pack (as documented [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]). Some packages rely on binaries. And if we want to use those packages in Spark, we need to include their binaries in the {{PATH}}. But we can't set the {{PATH}} via some config because 1) the environment with its binaries may be at a dynamic location (archives are unpacked on the driver [into a directory with random name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]), and 2) we may not want to override the {{PATH}} that's pre-configured on the hosts. Other use cases unlocked by this include overriding the executable dynamically (e.g., to select a version) or forking/redirecting the worker's output stream. > Plugin API for PySpark and SparkR subprocesses > -- > > Key: SPARK-44767 > URL: https://issues.apache.org/jira/browse/SPARK-44767 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > > An API to customize Python and R workers allows for extensibility beyond what > can be expressed via static configs and environment variables like, e.g., > {{spark.pyspark.python}}. > A use case for this is overriding {{PATH}} when using {{spark.archives}} > with, say, conda-pack (as documented > [here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]). > Some packages rely on binaries. And if we want to use those packages in > Spark, we need to include their binaries in the {{PATH}}. > But we can't set the {{PATH}} via some config because 1) the environment with > its binaries may be at a dynamic location (archives are unpacked on the > driver [into a directory with random > name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]), > and 2) we may not want to override the {{PATH}} that's pre-configured on the > hosts. > Other use cases unlocked by this include overriding the executable > dynamically (e.g., to select a version) or forking/redirecting the worker's > output stream. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43142) DSL expressions fail on attribute with special characters
[ https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712395#comment-17712395 ] Willi Raschkowski commented on SPARK-43142: --- https://github.com/apache/spark/pull/40794 > DSL expressions fail on attribute with special characters > - > > Key: SPARK-43142 > URL: https://issues.apache.org/jira/browse/SPARK-43142 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Willi Raschkowski >Priority: Major > > Expressions on implicitly converted attributes fail if the attributes have > names containing special characters. They fail even if the attributes are > backtick-quoted: > {code:java} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> "`slashed/col`".attr > res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = > 'slashed/col > scala> "`slashed/col`".attr.asc > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '/' expecting {, '.', '-'}(line 1, pos 7) > == SQL == > slashed/col > ---^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43142) DSL expressions fail on attribute with special characters
[ https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712383#comment-17712383 ] Willi Raschkowski commented on SPARK-43142: --- The solution I'd propose is to have {{DslAttr.attr}} return the attribute it's wrapping instead of creating a new attribute. > DSL expressions fail on attribute with special characters > - > > Key: SPARK-43142 > URL: https://issues.apache.org/jira/browse/SPARK-43142 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Willi Raschkowski >Priority: Major > > Expressions on implicitly converted attributes fail if the attributes have > names containing special characters. They fail even if the attributes are > backtick-quoted: > {code:java} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> "`slashed/col`".attr > res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = > 'slashed/col > scala> "`slashed/col`".attr.asc > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '/' expecting {, '.', '-'}(line 1, pos 7) > == SQL == > slashed/col > ---^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43142) DSL expressions fail on attribute with special characters
[ https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712383#comment-17712383 ] Willi Raschkowski edited comment on SPARK-43142 at 4/14/23 1:18 PM: The solution I'd propose is to have {{DslAttr.attr}} return the attribute it's wrapping instead of creating a new attribute. I'll put up a PR. was (Author: raschkowski): The solution I'd propose is to have {{DslAttr.attr}} return the attribute it's wrapping instead of creating a new attribute. > DSL expressions fail on attribute with special characters > - > > Key: SPARK-43142 > URL: https://issues.apache.org/jira/browse/SPARK-43142 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Willi Raschkowski >Priority: Major > > Expressions on implicitly converted attributes fail if the attributes have > names containing special characters. They fail even if the attributes are > backtick-quoted: > {code:java} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> "`slashed/col`".attr > res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = > 'slashed/col > scala> "`slashed/col`".attr.asc > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '/' expecting {, '.', '-'}(line 1, pos 7) > == SQL == > slashed/col > ---^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43142) DSL expressions fail on attribute with special characters
[ https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712382#comment-17712382 ] Willi Raschkowski commented on SPARK-43142: --- Here's what's happening: {{ImplicitOperators}} methods like {{asc}} rely on a call to {{expr}} [(Github)|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L149]. The {{UnresolvedAttribute}} returned by {{.attr}} is implicitly converted to {{DslAttr}}. But {{DslAttr}} does not implement {{expr}} by returning the attribute it's already wrapping. Instead, it only implements how to convert the attribute it's wrapping to a string name [(Github)|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L273-L275]. Returning an attribute for an implicitly wrapped attribute is implemented on the super class {{ImplicitAttribute}} by creating a new {{UnresolvedAttribute}} on the string name return by {{DslAttr}} (the method call {{s}}, [Github|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L278-L280]). The problem is that this string name returned by {{DslAttr}} no longer has the quotes and thus the new {{UnresolvedAttribute}} parses an unquoted identifier. {code} scala> "`col/slash`".attr.name res1: String = col/slash {code} > DSL expressions fail on attribute with special characters > - > > Key: SPARK-43142 > URL: https://issues.apache.org/jira/browse/SPARK-43142 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Willi Raschkowski >Priority: Major > > Expressions on implicitly converted attributes fail if the attributes have > names containing special characters. They fail even if the attributes are > backtick-quoted: > {code:java} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> "`slashed/col`".attr > res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = > 'slashed/col > scala> "`slashed/col`".attr.asc > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '/' expecting {, '.', '-'}(line 1, pos 7) > == SQL == > slashed/col > ---^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43142) DSL expressions fail on attribute with special characters
Willi Raschkowski created SPARK-43142: - Summary: DSL expressions fail on attribute with special characters Key: SPARK-43142 URL: https://issues.apache.org/jira/browse/SPARK-43142 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Willi Raschkowski Expressions on implicitly converted attributes fail if the attributes have names containing special characters. They fail even if the attributes are backtick-quoted: {code:java} scala> import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.dsl.expressions._ scala> "`slashed/col`".attr res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = 'slashed/col scala> "`slashed/col`".attr.asc org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '/' expecting {, '.', '-'}(line 1, pos 7) == SQL == slashed/col ---^^^ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35324) Spark SQL configs not respected in RDDs
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17709352#comment-17709352 ] Willi Raschkowski commented on SPARK-35324: --- I understand this better now: * When calling {{SQLConf.get}} on executors, the configs are read from the local properties on the {{{}TaskContext{}}}. The local properties are populated driver-side when scheduling the job, using the properties found in {{{}sparkContext.localProperties{}}}. * For RDD actions like {{{}rdd.count{}}}, nothing populates moves driver-side SQL configs into the SparkContext's local properites. * For datasets, all actions incl. {{{}dataset.count{}}}, are wrapped in an {{withAction}} call [(e.g.)|https://github.com/apache/spark/blob/v3.3.2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3160]. * {{withAction}} wraps the action in {{SQLExecution.withNewExecutionId}}, which in turn wraps it in {{SQLExecution.withSQLConfPropagated}}. This latter method copies SQL configs into the SparkContext's local properties. So in summary, all actions on datasets get wrapped in {{withSQLConfPropagated}} while actions on RDDs aren't. That's why {{df.count}} works but {{df.rdd.count}} doesn't. With {{count}} the answer is to just use {{Dataset.count}}. But, e.g., {{df.toLocalIterator}} has no alternative. To fix this, Spark would have to always copy configs into local properties (e.g. in {{submitJob}}). If maintainers likes that, I'll put up a PR. If not, feel free to close. In the meantime, my work-around is to call {{SQLExecution.withSQLConfPropagated}} myself. {code} scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") scala> spark.read.schema("date date").option("dateFormat", "MM/dd/yy").csv(Seq("2/6/18").toDS()).toLocalIterator.next 23/04/06 13:58:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] ... scala> SQLExecution.withSQLConfPropagated(spark) { | spark.read.schema("date date").option("dateFormat", "MM/dd/yy").csv(Seq("2/6/18").toDS()).toLocalIterator.next | } res2: org.apache.spark.sql.Row = [2018-02-06] {code} > Spark SQL configs not respected in RDDs > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot > 2021-05-06 at 00.35.10.png > > > When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in > actions on the resulting dataframe. But it's ignored in actions on > dataframe's RDD. > E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to > be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as > you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the > override and read the config value as {{EXCEPTION}}. > For instance: > {code:java|title=test.csv} > date > 2/6/18 > {code} > {code:java} > scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") > scala> val df = { > | spark.read > | .option("header", "true") > | .option("dateFormat", "MM/dd/yy") > | .schema("date date") > | .csv("/Users/wraschkowski/Downloads/test.csv") > | } > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > +--+ > > | date| > +--+ > |2018-02-06| > +--+ > scala> df.count > res3: Long = 1 > scala> df.rdd.count > 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can > set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior > before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime > string. > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) > at > scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spar
[jira] [Created] (SPARK-42373) Remove unused blank line removal from CSVExprUtils
Willi Raschkowski created SPARK-42373: - Summary: Remove unused blank line removal from CSVExprUtils Key: SPARK-42373 URL: https://issues.apache.org/jira/browse/SPARK-42373 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.1 Reporter: Willi Raschkowski The non-multiline CSV read codepath contains references to removal of blank lines throughout. This is not necessary as blank lines are removed by the parser. Furthermore, it causes confusion, indicating that blank lines are removed at this point when instead they are already omitted from the data. The multiline code-path does not explicitly remove blank lines leading to what looks like disparity in behavior between the two. The codepath for {{DataFrameReader.csv(dataset: Dataset[String])}} does need to explicitly skip lines, and this should be respected in {{CSVUtils}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42359) Support row skipping when reading CSV files
[ https://issues.apache.org/jira/browse/SPARK-42359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684689#comment-17684689 ] Willi Raschkowski commented on SPARK-42359: --- This repeats SPARK-26406 but it's worth reconsidering now that SQL / DataFrame APIs established themselves as "preferred" way to interact with Spark and platforms like Databricks SQL increase collaboration with less-technical users. Meanwhile, the RDD and {{zipWithIndex}} workaround is awkward because it implies some ordering that can't be assumed at the RDD-level but the datasource _can_ assume at the CSV-level. > Support row skipping when reading CSV files > --- > > Key: SPARK-42359 > URL: https://issues.apache.org/jira/browse/SPARK-42359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screenshot 2023-02-06 at 13.23.34.png > > > Spark currently can't read CSV files that contain lines with comments or > annotations above the header and data. Work-arounds include pre-processing > CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these > increase friction for less technical users. > This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a > number of unwanted lines at the top of a CSV file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42359) Support row skipping when reading CSV files
[ https://issues.apache.org/jira/browse/SPARK-42359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684683#comment-17684683 ] Willi Raschkowski commented on SPARK-42359: --- In our experience such CSV files tend to be Excel exports where users like to populate rows above the header with descriptions of the data. To give a real-world example: [here's a dataset made available by the UK government (data.gov.uk)|https://www.data.gov.uk/dataset/9003012e-4564-4a6b-b5f0-8765ccb23a03/average-road-fuel-sales-deliveries-and-stock-levels]. The dataset is only available via Excel files that look like this: !Screenshot 2023-02-06 at 13.23.34.png! Exporting from Excel for consumption in Spark results in a CSV that looks like this: {code} cat ~/Downloads/20230202_Average_road_fuel_sales_deliveries_and_stock_levels.csv | head -n 15 | cut -c1-150 "Average road fuel deliveries at sampled filling stations: United Kingdom, from 27 January 2020 [note 1][note 2][note 3]" This worksheet contains one table. Some cells refer to notes which can be found in the notes worksheet.,,, "Freeze panes are turned on. To turn off freeze panes select the 'View' ribbon then 'Freeze Panes' then 'Unfreeze Panes' or use [Alt,W,F]" Source: BEIS,, Released: 02 February 2023 Return to contents Units: Volume in litres,,, Date,Weekday,Fuel Type,North East,North West,Yorkshire and The Humber,"East Midlands","West Midlands",East,London,South East,South West,Northern Ireland,Wales,Scotland,"England [note 3]",United Kingdom,, 27/01/2020,Monday,Diesel," 10,583 "," 9,422 "," 11,687 "," 11,205 "," 11,353 "," 10,284 "," 7,501 "," 10,023 "," 9,535 "," 8,511 "," 9,961 "," 9,600 " 28/01/2020,Tuesday,Diesel," 11,643 "," 10,440 "," 13,172 "," 11,885 "," 12,943 "," 12,255 "," 7,310 "," 10,106 "," 11,144 "," 7,740 "," 10,306 "," 10, 29/01/2020,Wednesday,Diesel," 10,839 "," 10,021 "," 11,417 "," 12,195 "," 11,370 "," 12,542 "," 8,102 "," 11,235 "," 10,840 "," 6,943 "," 11,532 "," 9 30/01/2020,Thursday,Diesel," 8,808 "," 10,673 "," 11,871 "," 13,469 "," 12,727 "," 12,445 "," 7,708 "," 11,044 "," 9,741 "," 7,456 "," 10,647 "," 10,2 {code} > Support row skipping when reading CSV files > --- > > Key: SPARK-42359 > URL: https://issues.apache.org/jira/browse/SPARK-42359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screenshot 2023-02-06 at 13.23.34.png > > > Spark currently can't read CSV files that contain lines with comments or > annotations above the header and data. Work-arounds include pre-processing > CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these > increase friction for less technical users. > This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a > number of unwanted lines at the top of a CSV file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42359) Support row skipping when reading CSV files
[ https://issues.apache.org/jira/browse/SPARK-42359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-42359: -- Attachment: Screenshot 2023-02-06 at 13.23.34.png > Support row skipping when reading CSV files > --- > > Key: SPARK-42359 > URL: https://issues.apache.org/jira/browse/SPARK-42359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screenshot 2023-02-06 at 13.23.34.png > > > Spark currently can't read CSV files that contain lines with comments or > annotations above the header and data. Work-arounds include pre-processing > CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these > increase friction for less technical users. > This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a > number of unwanted lines at the top of a CSV file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42359) Support row skipping when reading CSV files
Willi Raschkowski created SPARK-42359: - Summary: Support row skipping when reading CSV files Key: SPARK-42359 URL: https://issues.apache.org/jira/browse/SPARK-42359 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.1 Reporter: Willi Raschkowski Spark currently can't read CSV files that contain lines with comments or annotations above the header and data. Work-arounds include pre-processing CSVs, or using RDDs and something like {{zipWithIndex}}. But all of these increase friction for less technical users. This issue proposes a {{skipLines}} option for Spark's CSV parser to drop a number of unwanted lines at the top of a CSV file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33760) Extend Dynamic Partition Pruning Support to DataSources
[ https://issues.apache.org/jira/browse/SPARK-33760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584882#comment-17584882 ] Willi Raschkowski commented on SPARK-33760: --- Is this related to SPARK-35779? > Extend Dynamic Partition Pruning Support to DataSources > --- > > Key: SPARK-33760 > URL: https://issues.apache.org/jira/browse/SPARK-33760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Anoop Johnson >Priority: Major > > The implementation of Dynamic Partition Pruning (DPP) in Spark is > [specific|https://github.com/apache/spark/blob/fb2e3af4b5d92398d57e61b766466cc7efd9d7cb/sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala#L59-L64] > to HadoopFSRelation. As a result, DPP is not triggered for queries that use > data sources. > The DataSource v2 readers can expose the partition metadata. Can we use this > metadata and extend DPP to work on data sources as well? > Would appreciate thoughts or corner cases we need to handle. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570048#comment-17570048 ] Willi Raschkowski commented on SPARK-39659: --- [~hyukjin.kwon], we talked about this at the conference. Basically Spark somewhat supports running Python from environments. But it doesn't run "conda activate" and the equivalent for other package managers. We could approximate "conda activate" by updating the PATH. Curious if you think this is a general problem. Or if you think Spark shouldn't be solving it. Cluster owners should solve it by setting PATH themselves. The reason we don't do that is that we want the {{PATH}} elements to be absolute locations (to support e.g. {{Popen}} following a {{chdir}}) and for YARN we don't know the working directory location in advance. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 99, in executable_path > raise ValueError( > ValueError: > The kaleido executable is required by the kaleido Python library, but it was > not included > in the Python package and it could not be found on the system PATH. > Searched for included kaleido executable at: > > /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/app
[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561661#comment-17561661 ] Willi Raschkowski commented on SPARK-39659: --- Another way could be to add a config like {{spark.(driver|executor).extraPathDirs}} which the PythonRunner (and RRunner) apply to their {{ProcessBuilder}}. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 99, in executable_path > raise ValueError( > ValueError: > The kaleido executable is required by the kaleido Python library, but it was > not included > in the Python package and it could not be found on the system PATH. > Searched for included kaleido executable at: > > /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/executable/kaleido > > Searched for executable 'kaleido' on the following system PATH: > /usr/local/sbin > /usr/local/bin > /usr/sbin > /usr/bin > /sbin > /bin > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issue
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656 ] Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 3:16 AM: --- Alternatively, one could update {{PATH}} to point to something like {{./environment/bin/}}. -But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't right. For drivers in k8s cluster mode, the environment is downloaded into the working directory. So we can add the working directory to {{PATH}}. Still, that's inconvenient to do because you need to modify infrastructure, i.e. set {{PATH}} on YARN nodes or in the K8s image. was (Author: raschkowski): Alternatively, one could update {{PATH}} to point to something like {{./environment/bin/}}. -But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't right. For drivers in k8s cluster mode, the environment is downloaded into the working directory. So we can add the working directory to {{PATH}}. Still, that's inconvenient to do: 1) Using relative paths in {{PATH}} is fickle. 2) You need to modify infrastructure, i.e. set {{PATH}} on YARN nodes or in the K8s image. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656 ] Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 3:15 AM: --- Alternatively, one could update {{PATH}} to point to something like {{./environment/bin/}}. -But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't right. For drivers in k8s cluster mode, the environment is downloaded into the working directory. So we can add the working directory to {{PATH}}. Still, that's inconvenient to do: 1) Using relative paths in {{PATH}} is fickle. 2) You need to modify infrastructure, i.e. set {{PATH}} on YARN nodes or in the K8s image. was (Author: raschkowski): Alternatively, one could update {{PATH}} to point to something like {{{}./environment/bin/{}}}. -But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't right. For drivers in k8s cluster mode, the environment is downloaded into the working directory. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656 ] Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 3:08 AM: --- Alternatively, one could update {{PATH}} to point to something like {{{}./environment/bin/{}}}. -But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{SparkFiles}}.- Actually, this isn't right. For drivers in k8s cluster mode, the environment is downloaded into the working directory. was (Author: raschkowski): Alternatively, one could update {{PATH}} to point to something like {{{}./environment/bin/{}}}. ~But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{{}SparkFiles{}}}.~ Actually, this isn't right. See comment below. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 99, in executable_path > raise ValueError( > ValueError: > The kaleido executable is required by the kaleido Python library, but it was > not included > in the Python package and it could not be found on the system PATH. > Searched for included kaleido execu
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656 ] Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 3:06 AM: --- Alternatively, one could update {{PATH}} to point to something like {{{}./environment/bin/{}}}. ~But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{{}SparkFiles{}}}.~ Actually, this isn't right. See comment below. was (Author: raschkowski): Alternatively, one could update {{PATH}} to point to something like {{./environment/bin/}}. But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{SparkFiles}}. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 99, in executable_path > raise ValueError( > ValueError: > The kaleido executable is required by the kaleido Python library, but it was > not included > in the Python package and it could not be found on the system PATH. > Searched for included kaleido executable at: > > /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_165
[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561657#comment-17561657 ] Willi Raschkowski commented on SPARK-39659: --- Anyway, wanted to get your thoughts on this. If you think that adding the parent folder of the Python executable is the right move (i.e. for {{./environment/bin/python}} we do {{$PATH:./environment/bin}}, I can put up a PR. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 99, in executable_path > raise ValueError( > ValueError: > The kaleido executable is required by the kaleido Python library, but it was > not included > in the Python package and it could not be found on the system PATH. > Searched for included kaleido executable at: > > /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/executable/kaleido > > Searched for executable 'kaleido' on the following system PATH: > /usr/local/sbin > /usr/local/bin > /usr/sbin > /usr/bin > /sbin > /bin > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655 ] Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:56 AM: --- The way we solve this in our fork is by doing something like {code:scala} PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder) {code} with {code:scala} /** * Append the directory to the subprocess' PATH environment variable. * * This allows the Python subprocess to find additional executables when the environment * containing those executables was added at runtime (e.g. via sc.addArchive()). */ def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = { processBuilder.environment().compute("PATH", (_, oldPath) => Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString)) } {code} inside the driver- and executor-side PythonRunners. was (Author: raschkowski): The way we solve this in our fork is by doing something like {code:scala} /** * Append the directory to the subprocess' PATH environment variable. * * This allows the Python subprocess to find additional executables when the environment * containing those executables was added at runtime (e.g. via sc.addArchive()). */ def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = { processBuilder.environment().compute("PATH", (_, oldPath) => Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString)) } {code} and {code:scala} PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder) {code} inside the driver- and executor-side PythonRunners. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base
[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656 ] Willi Raschkowski commented on SPARK-39659: --- Alternatively, one could update {{PATH}} to point to something like {{./environment/bin/}}. But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{SparkFiles}}. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 99, in executable_path > raise ValueError( > ValueError: > The kaleido executable is required by the kaleido Python library, but it was > not included > in the Python package and it could not be found on the system PATH. > Searched for included kaleido executable at: > > /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/executable/kaleido > > Searched for executable 'kaleido' on the following system PATH: > /usr/local/sbin > /usr/local/bin > /usr/sbin > /usr/bin > /sbin > /bin > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) -
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561656#comment-17561656 ] Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:55 AM: --- Alternatively, one could update {{PATH}} to point to something like {{./environment/bin/}}. But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{SparkFiles}}. was (Author: raschkowski): Alternatively, one could update {{PATH}} to point to something like {{./environment/bin/}}. But when using k8s, we don't know the location of the environment on driver beforehand, because it's unarchived under {{SparkFiles}}. > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 99, in executable_path > raise ValueError( > ValueError: > The kaleido executable is required by the kaleido Python library, but it was > not included > in the Python package and it could not be found on the system PATH. > Searched for included kaleido executable at: > > /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-p
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655 ] Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:54 AM: --- The way we solve this in our fork is by doing something like {code:scala} /** * Append the directory to the subprocess' PATH environment variable. * * This allows the Python subprocess to find additional executables when the environment * containing those executables was added at runtime (e.g. via sc.addArchive()). */ def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = { processBuilder.environment().compute("PATH", (_, oldPath) => Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString)) } {code} and {code:scala} PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder) {code} inside the driver- and executor-side PythonRunners. was (Author: raschkowski): The way we solve this in our fork is by doing something like {code:scala} /** * Append the directory to the subprocess' PATH environment variable. * * This allows the Python subprocess to find additional executables when the environment * containing those executables was added at runtime (e.g. via sc.addArchive()). */ def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = { processBuilder.environment().compute("PATH", (_, oldPath) => Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString)) } {code} and {code:scala} PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder) {code} > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_arg
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655 ] Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:53 AM: --- The way we solve this in our fork is by doing something like {code:scala} /** * Append the directory to the subprocess' PATH environment variable. * * This allows the Python subprocess to find additional executables when the environment * containing those executables was added at runtime (e.g. via sc.addArchive()). */ def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = { processBuilder.environment().compute("PATH", (_, oldPath) => Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString)) } {code} and {code:scala} PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder) {code} was (Author: raschkowski): The way we solve this in our fork is by doing something like {code:scala} /** * Append the directory to the subprocess' PATH environment variable. * * This allows the Python subprocess to find additional executables when the environment * containing those executables was added at runtime (e.g. via sc.addArchive()). */ def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = { processBuilder.environment().compute("PATH", (_, oldPath) => Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString)) } {code:scala} and {code} PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder) {code} > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop
[jira] [Comment Edited] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655 ] Willi Raschkowski edited comment on SPARK-39659 at 7/2/22 2:53 AM: --- The way we solve this in our fork is by doing something like {code:scala} /** * Append the directory to the subprocess' PATH environment variable. * * This allows the Python subprocess to find additional executables when the environment * containing those executables was added at runtime (e.g. via sc.addArchive()). */ def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = { processBuilder.environment().compute("PATH", (_, oldPath) => Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString)) } {code:scala} and {code} PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder) {code} was (Author: raschkowski): The way we solve this in our fork is by doing something like {code:scala} /** * Append the directory to the subprocess' PATH environment variable. * * This allows the Python subprocess to find additional executables when the environment * containing those executables was added at runtime (e.g. via sc.addArchive()). */ def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = { processBuilder.environment().compute("PATH", (_, oldPath) => Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString)) } {code} and {code} PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder) {code} > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoo
[jira] [Commented] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
[ https://issues.apache.org/jira/browse/SPARK-39659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561655#comment-17561655 ] Willi Raschkowski commented on SPARK-39659: --- The way we solve this in our fork is by doing something like {code:scala} /** * Append the directory to the subprocess' PATH environment variable. * * This allows the Python subprocess to find additional executables when the environment * containing those executables was added at runtime (e.g. via sc.addArchive()). */ def appendDirToEnvironmentPath(dir: Path, processBuilder: ProcessBuilder): Unit = { processBuilder.environment().compute("PATH", (_, oldPath) => Option(oldPath).map(_ + File.pathSeparator + dir).getOrElse(dir.toString)) } {code} and {code} PythonUtils.appendDirToEnvironmentPath(Paths.get(pythonExec).toAbsolutePath.getParent, builder) {code} > Add environment bin folder to R/Python subprocess PATH > -- > > Key: SPARK-39659 > URL: https://issues.apache.org/jira/browse/SPARK-39659 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Willi Raschkowski >Priority: Major > > Some Python packages rely on non-Python executables which are usually made > available on the {{PATH}} through something like {{{}conda activate{}}}. > When using Spark with conda-pack environments added via > {{{}spark.archives{}}}, Python packages aren't able to find conda-installed > executables because Spark doesn't update {{{}PATH{}}}. > E.g. > {code:java|title=test.py} > # This only works if kaleido-python can find the conda-installed executable > fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", > color="species") > fig.write_image("figure.png", engine="kaleido") > {code} > and > {code:java} > ./bin/spark-submit --master yarn --deploy-mode cluster --archives > environment.tar.gz#environment --conf > spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py > {code} > will throw > {code:java} > Traceback (most recent call last): > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", > line 7, in > fig.write_image("figure.png", engine="kaleido") > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", > line 3829, in write_image > return pio.write_image(self, *args, **kwargs) > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 267, in write_image > img_data = to_image( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", > line 144, in to_image > img_bytes = scope.transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", > line 153, in transform > response = self._perform_transform( > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 293, in _perform_transform > self._ensure_kaleido() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 176, in _ensure_kaleido > proc_args = self._build_proc_args() > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 123, in _build_proc_args > proc_args = [self.executable_path(), self.scope_name] > File > "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", > line 99, in executable_path > raise ValueError( > ValueError: > The kaleido executable is required by the kaleido Python library, but it was > not included > in the Python package and it could not be found on the system PATH. > Searched for included kaleido executable at: > > /
[jira] [Created] (SPARK-39659) Add environment bin folder to R/Python subprocess PATH
Willi Raschkowski created SPARK-39659: - Summary: Add environment bin folder to R/Python subprocess PATH Key: SPARK-39659 URL: https://issues.apache.org/jira/browse/SPARK-39659 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Willi Raschkowski Some Python packages rely on non-Python executables which are usually made available on the {{PATH}} through something like {{{}conda activate{}}}. When using Spark with conda-pack environments added via {{{}spark.archives{}}}, Python packages aren't able to find conda-installed executables because Spark doesn't update {{{}PATH{}}}. E.g. {code:java|title=test.py} # This only works if kaleido-python can find the conda-installed executable fig = px.scatter(px.data.iris(), x="sepal_length", y="sepal_width", color="species") fig.write_image("figure.png", engine="kaleido") {code} and {code:java} ./bin/spark-submit --master yarn --deploy-mode cluster --archives environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python test.py {code} will throw {code:java} Traceback (most recent call last): File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/kaleido-test.py", line 7, in fig.write_image("figure.png", engine="kaleido") File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/basedatatypes.py", line 3829, in write_image return pio.write_image(self, *args, **kwargs) File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", line 267, in write_image img_data = to_image( File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/plotly/io/_kaleido.py", line 144, in to_image img_bytes = scope.transform( File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/plotly.py", line 153, in transform response = self._perform_transform( File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 293, in _perform_transform self._ensure_kaleido() File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 176, in _ensure_kaleido proc_args = self._build_proc_args() File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 123, in _build_proc_args proc_args = [self.executable_path(), self.scope_name] File "/tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/scopes/base.py", line 99, in executable_path raise ValueError( ValueError: The kaleido executable is required by the kaleido Python library, but it was not included in the Python package and it could not be found on the system PATH. Searched for included kaleido executable at: /tmp/hadoop-hadoop/nm-local-dir/usercache/wraschkowski/appcache/application_1656456739406_0012/container_1656456739406_0012_01_01/environment/lib/python3.10/site-packages/kaleido/executable/kaleido Searched for executable 'kaleido' on the following system PATH: /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39107) Silent change in regexp_replace's handling of empty strings
[ https://issues.apache.org/jira/browse/SPARK-39107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-39107: -- Labels: correctness (was: ) > Silent change in regexp_replace's handling of empty strings > --- > > Key: SPARK-39107 > URL: https://issues.apache.org/jira/browse/SPARK-39107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > Labels: correctness > > Hi, we just upgraded from 3.0.2 to 3.1.2 and noticed a silent behavior change > that a) seems incorrect, and b) is undocumented in the [migration > guide|https://spark.apache.org/docs/latest/sql-migration-guide.html]: > {code:title=3.0.2} > scala> val df = spark.sql("SELECT '' AS col") > df: org.apache.spark.sql.DataFrame = [col: string] > scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", > "")).show > +---++ > |col|replaced| > +---++ > | | | > +---++ > {code} > {code:title=3.1.2} > scala> val df = spark.sql("SELECT '' AS col") > df: org.apache.spark.sql.DataFrame = [col: string] > scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", > "")).show > +---++ > |col|replaced| > +---++ > | || > +---++ > {code} > Note, the regular expression {{^$}} should match the empty string, but > doesn't in version 3.1. E.g. this is the Java behavior: > {code} > scala> "".replaceAll("^$", ""); > res1: String = > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39107) Silent change in regexp_replace's handling of empty strings
Willi Raschkowski created SPARK-39107: - Summary: Silent change in regexp_replace's handling of empty strings Key: SPARK-39107 URL: https://issues.apache.org/jira/browse/SPARK-39107 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: Willi Raschkowski Hi, we just upgraded from 3.0.2 to 3.1.2 and noticed a silent behavior change that a) seems incorrect, and b) is undocumented in the [migration guide|https://spark.apache.org/docs/latest/sql-migration-guide.html]: {code:title=3.0.2} scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "")).show +---++ |col|replaced| +---++ | | | +---++ {code} {code:title=3.1.2} scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "")).show +---++ |col|replaced| +---++ | || +---++ {code} Note, the regular expression {{^$}} should match the empty string, but doesn't in version 3.1. E.g. this is the Java behavior: {code} scala> "".replaceAll("^$", ""); res1: String = {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529381#comment-17529381 ] Willi Raschkowski edited comment on SPARK-39044 at 4/28/22 11:29 AM: - [~hyukjin.kwon], yes I know. But I wasn't able to get a self-contained reproducer. This reliably fails in prod. But using that same TypedImperativeAggregate with {{observe()}} in local tests works fine. If you have ideas on what to try, I will. (Also happy to share the aggregate, but from the stacktrace I understood the implementation isn't relevant - it's the {{AggregatingAccumulator}} buffer that is {{null}}. Anyway, I attached [^aggregate.scala].) I understand if you close this ticket because you cannot root-cause without a repro. was (Author: raschkowski): [~hyukjin.kwon], yes I know. But I wasn't able to get a self-contained reproducer. This reliably fails in prod. But using that same TypedImperativeAggregate with {{observe()}} in local tests works fine. If you have ideas on what to try, I will. (Also happy to share the aggregate, but from the stacktrace I understood the implementation isn't relevant - it's the {{AggregatingAccumulator}} buffer that is {{{}null{}}}.) I understand if you close this ticket because you cannot root-cause without a repro. > AggregatingAccumulator with TypedImperativeAggregate throwing > NullPointerException > -- > > Key: SPARK-39044 > URL: https://issues.apache.org/jira/browse/SPARK-39044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: aggregate.scala > > > We're using a custom TypedImperativeAggregate inside an > AggregatingAccumulator (via {{observe()}} and get the error below. It looks > like we're trying to serialize an aggregation buffer that hasn't been > initialized yet. > {code} > Caused by: org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540) > ... > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in > stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: > java.lang.NullPointerException > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435) > at > org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) > at > java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460) > at > java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at > java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) > at > java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621) > at > org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205) > at > org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) > at > org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) > at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(D
[jira] [Updated] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-39044: -- Attachment: aggregate.scala > AggregatingAccumulator with TypedImperativeAggregate throwing > NullPointerException > -- > > Key: SPARK-39044 > URL: https://issues.apache.org/jira/browse/SPARK-39044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: aggregate.scala > > > We're using a custom TypedImperativeAggregate inside an > AggregatingAccumulator (via {{observe()}} and get the error below. It looks > like we're trying to serialize an aggregation buffer that hasn't been > initialized yet. > {code} > Caused by: org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540) > ... > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in > stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: > java.lang.NullPointerException > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435) > at > org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) > at > java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460) > at > java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at > java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) > at > java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621) > at > org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205) > at > org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) > at > org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) > at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235) > at > java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137) > at > java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) > at > scala.runtime.java8
[jira] [Commented] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529381#comment-17529381 ] Willi Raschkowski commented on SPARK-39044: --- [~hyukjin.kwon], yes I know. But I wasn't able to get a self-contained reproducer. This reliably fails in prod. But using that same TypedImperativeAggregate with {{observe()}} in local tests works fine. If you have ideas on what to try, I will. (Also happy to share the aggregate, but from the stacktrace I understood the implementation isn't relevant - it's the {{AggregatingAccumulator}} buffer that is {{{}null{}}}.) I understand if you close this ticket because you cannot root-cause without a repro. > AggregatingAccumulator with TypedImperativeAggregate throwing > NullPointerException > -- > > Key: SPARK-39044 > URL: https://issues.apache.org/jira/browse/SPARK-39044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Willi Raschkowski >Priority: Major > > We're using a custom TypedImperativeAggregate inside an > AggregatingAccumulator (via {{observe()}} and get the error below. It looks > like we're trying to serialize an aggregation buffer that hasn't been > initialized yet. > {code} > Caused by: org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540) > ... > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in > stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: > java.lang.NullPointerException > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435) > at > org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) > at > java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460) > at > java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at > java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) > at > java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621) > at > org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205) > at > org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) > at > org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) > at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235) > at > java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137) > at > java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) > at scala.collection.Iterator.foreach(Iterator.scala:94
[jira] [Commented] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528751#comment-17528751 ] Willi Raschkowski commented on SPARK-39044: --- This worked on Spark 3.0. [~beliefer], given we're hitting this in {{withBufferSerialized}}, I think this might be related to SPARK-37203. > AggregatingAccumulator with TypedImperativeAggregate throwing > NullPointerException > -- > > Key: SPARK-39044 > URL: https://issues.apache.org/jira/browse/SPARK-39044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Willi Raschkowski >Priority: Major > > We're using a custom TypedImperativeAggregate inside an > AggregatingAccumulator (via {{observe()}} and get the error below. It looks > like we're trying to serialize an aggregation buffer that hasn't been > initialized yet. > {code} > Caused by: org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > at > org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540) > ... > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in > stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: > java.lang.NullPointerException > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435) > at > org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) > at > java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460) > at > java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at > java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) > at > java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621) > at > org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205) > at > org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) > at > org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) > at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235) > at > java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137) > at > java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) > at > org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at
[jira] [Updated] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-39044: -- Description: We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator (via {{observe()}} and get the error below. It looks like we're trying to serialize an aggregation buffer that hasn't been initialized yet. {code} Caused by: org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:540) ... Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in stage 1.0 (TID 32) (10.0.134.136 executor 3): java.io.IOException: java.lang.NullPointerException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) at java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460) at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137) at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1428) ... 11 more {code} was: We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator (via {{observe()}} and get the error below. It looks like we're trying to serialize an aggregation buffer that hasn't been initialized yet. {code} Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(Ta
[jira] [Updated] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-39044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-39044: -- Description: We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator (via {{observe()}} and get the error below. It looks like we're trying to serialize an aggregation buffer that hasn't been initialized yet. {code} Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) at java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460) at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ... 1 more Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137) at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1428) ... 11 more {code} was: We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator (via {{observe()}} and get this error below: {code} Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) at java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460) at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.ru
[jira] [Created] (SPARK-39044) AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException
Willi Raschkowski created SPARK-39044: - Summary: AggregatingAccumulator with TypedImperativeAggregate throwing NullPointerException Key: SPARK-39044 URL: https://issues.apache.org/jira/browse/SPARK-39044 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: Willi Raschkowski We're using a custom TypedImperativeAggregate inside an AggregatingAccumulator (via {{observe()}} and get this error below: {code} Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1435) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) at java.base/java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1460) at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:633) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ... 1 more Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:638) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:599) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:621) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at java.base/java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1235) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1137) at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1428) ... 11 more {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36768) Cannot resolve attribute with table reference
[ https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-36768: -- Issue Type: Bug (was: Task) > Cannot resolve attribute with table reference > - > > Key: SPARK-36768 > URL: https://issues.apache.org/jira/browse/SPARK-36768 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.3, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark seems in some cases unable to resolve attributes that contain > multi-part names where the first parts reference a table. Here's a repro: > {code:python} > >>> spark.range(3).toDF("col").write.parquet("testdata") > # Single name part attribute is fine > >>> spark.sql("SELECT col FROM parquet.testdata").show() > +---+ > |col| > +---+ > | 1| > | 0| > | 2| > +---+ > # Name part with the table reference fails > >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() > AnalysisException: cannot resolve '`parquet.testdata.col`' given input > columns: [col]; line 1 pos 7; > 'Project ['parquet.testdata.col] > +- Relation[col#50L] parquet > {code} > The expected behavior is that {{parquet.testdata.col}} is recognized as > referring to attribute {{col}} in {{parquet.testdata}} (you'd expect > {{AttributeSeq.resolve}} matches [this > case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition
[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-38166: -- Description: We're seeing duplicates after running the following {code:java} def compute_shipments(shipments): shipments = shipments.dropDuplicates(["ship_trck_num"]) shipments = shipments.repartition(4) return shipments {code} and observing lost executors (OOMs) and task retries in the repartition stage. We're seeing this reliably in one of our pipelines. But I haven't managed to reproduce outside of that pipeline. I'll attach driver logs - maybe you have ideas. was: We're seeing duplicates after running the following {code} def compute_shipments(shipments): shipments = shipments.dropDuplicates(["ship_trck_num"]) shipments = shipments.repartition(4) return shipments {code} and observing lost executors (OOMs) and task retries in the repartition stage. We're seeing this reliably in one of our pipelines. But I haven't managed to reproduce outside of that pipeline. I'll attach driver logs and the notionalized input data - maybe you have ideas. > Duplicates after task failure in dropDuplicates and repartition > --- > > Key: SPARK-38166 > URL: https://issues.apache.org/jira/browse/SPARK-38166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 > Environment: Cluster runs on K8s. AQE is enabled. >Reporter: Willi Raschkowski >Priority: Major > Labels: correctness > Attachments: driver.log > > > We're seeing duplicates after running the following > {code:java} > def compute_shipments(shipments): > shipments = shipments.dropDuplicates(["ship_trck_num"]) > shipments = shipments.repartition(4) > return shipments > {code} > and observing lost executors (OOMs) and task retries in the repartition stage. > We're seeing this reliably in one of our pipelines. But I haven't managed to > reproduce outside of that pipeline. I'll attach driver logs - maybe you have > ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition
[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489579#comment-17489579 ] Willi Raschkowski commented on SPARK-38166: --- Linking SPARK-23207 (which is closed but looks very related) and SPARK-25342 (which is open but I understand would only explain this if we were operating on RDDs). > Duplicates after task failure in dropDuplicates and repartition > --- > > Key: SPARK-38166 > URL: https://issues.apache.org/jira/browse/SPARK-38166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 > Environment: Cluster runs on K8s. AQE is enabled. >Reporter: Willi Raschkowski >Priority: Major > Labels: correctness > Attachments: driver.log > > > We're seeing duplicates after running the following > {code} > def compute_shipments(shipments): > shipments = shipments.dropDuplicates(["ship_trck_num"]) > shipments = shipments.repartition(4) > return shipments > {code} > and observing lost executors (OOMs) and task retries in the repartition stage. > We're seeing this reliably in one of our pipelines. But I haven't managed to > reproduce outside of that pipeline. I'll attach driver logs and the > notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition
[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489523#comment-17489523 ] Willi Raschkowski commented on SPARK-38166: --- Attaching driver logs: [^driver.log] Notable lines are probably: {code:java} ... INFO [2021-11-11T23:04:13.68737Z] org.apache.spark.scheduler.TaskSetManager: Task 1.1 in stage 6.0 (TID 60) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). INFO [2021-11-11T23:04:13.687562Z] org.apache.spark.scheduler.DAGScheduler: Marking ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) as failed due to a fetch failure from ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218) INFO [2021-11-11T23:04:13.688643Z] org.apache.spark.scheduler.DAGScheduler: ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) failed in 1012.545 s due to org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead. at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:663) ... Caused by: org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:132) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) ... INFO [2021-11-11T23:04:13.690385Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218) and ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) due to fetch failure INFO [2021-11-11T23:04:13.894248Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting failed stages ... {code} > Duplicates after task failure in dropDuplicates and repartition > --- > > Key: SPARK-38166 > URL: https://issues.apache.org/jira/browse/SPARK-38166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 > Environment: Cluster runs on K8s. AQE is enabled. >Reporter: Willi Raschkowski >Priority: Major > Labels: correctness > Attachments: driver.log > > > We're seeing duplicates after running the following > {code} > def compute_shipments(shipments): > shipments = shipments.dropDuplicates(["ship_trck_num"]) > shipments = shipments.repartition(4) > return shipments > {code} > and observing lost executors (OOMs) and task retries in the repartition stage. > We're seeing this reliably in one of our pipelines. But I haven't managed to > reproduce outside of that pipeline. I'll attach driver logs and the > notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition
[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-38166: -- Attachment: driver.log > Duplicates after task failure in dropDuplicates and repartition > --- > > Key: SPARK-38166 > URL: https://issues.apache.org/jira/browse/SPARK-38166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 > Environment: Cluster runs on K8s. AQE is enabled. >Reporter: Willi Raschkowski >Priority: Major > Labels: correctness > Attachments: driver.log > > > We're seeing duplicates after running the following > {code} > def compute_shipments(shipments): > shipments = shipments.dropDuplicates(["ship_trck_num"]) > shipments = shipments.repartition(4) > return shipments > {code} > and observing lost executors (OOMs) and task retries in the repartition stage. > We're seeing this reliably in one of our pipelines. But I haven't managed to > reproduce outside of that pipeline. I'll attach driver logs and the > notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition
Willi Raschkowski created SPARK-38166: - Summary: Duplicates after task failure in dropDuplicates and repartition Key: SPARK-38166 URL: https://issues.apache.org/jira/browse/SPARK-38166 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2 Environment: Cluster runs on K8s. AQE is enabled. Reporter: Willi Raschkowski We're seeing duplicates after running the following {code} def compute_shipments(shipments): shipments = shipments.dropDuplicates(["ship_trck_num"]) shipments = shipments.repartition(4) return shipments {code} and observing lost executors (OOMs) and task retries in the repartition stage. We're seeing this reliably in one of our pipelines. But I haven't managed to reproduce outside of that pipeline. I'll attach driver logs and the notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37465) PySpark tests failing on Pandas 0.23
[ https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449619#comment-17449619 ] Willi Raschkowski commented on SPARK-37465: --- I'll give the pandas bump a shot. > PySpark tests failing on Pandas 0.23 > > > Key: SPARK-37465 > URL: https://issues.apache.org/jira/browse/SPARK-37465 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Willi Raschkowski >Priority: Major > > I was running Spark tests with Pandas {{0.23.4}} and got the error below. The > minimum Pandas version is currently {{0.23.2}} > [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. > Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix > (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222] > in Pandas. > {code:java} > $ python/run-tests --testnames > 'pyspark.pandas.tests.data_type_ops.test_boolean_ops > BooleanOpsTest.test_floordiv' > ... > == > ERROR [5.785s]: test_floordiv > (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest) > -- > Traceback (most recent call last): > File > "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", > line 128, in test_floordiv > self.assert_eq(b_pser // b_pser.astype(int), b_psser // > b_psser.astype(int)) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", > line 1069, in wrapper > result = safe_na_op(lvalues, rvalues) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", > line 1033, in safe_na_op > return na_op(lvalues, rvalues) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", > line 1027, in na_op > result = missing.fill_zeros(result, x, y, op_name, fill_zeros) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", > line 641, in fill_zeros > signs = np.sign(y if name.startswith(('r', '__r')) else x) > TypeError: ufunc 'sign' did not contain a loop with signature matching types > dtype('bool') dtype('bool') > {code} > These are my relevant package versions: > {code:java} > $ conda list | grep -e numpy -e pyarrow -e pandas -e python > # packages in environment at /home/circleci/miniconda/envs/python3: > numpy 1.16.6 py36h0a8e133_3 > numpy-base1.16.6 py36h41b4c56_3 > pandas0.23.4 py36h04863e7_0 > pyarrow 1.0.1 py36h6200943_36_cpuconda-forge > python3.6.12 hcff3b4d_2anaconda > python-dateutil 2.8.1 py_0anaconda > python_abi3.6 1_cp36mconda-forg > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37465) PySpark tests failing on Pandas 0.23
[ https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449584#comment-17449584 ] Willi Raschkowski edited comment on SPARK-37465 at 11/26/21, 2:13 PM: -- I also noticed that {{CategoricalOpsTest}} fails on pandas 0.25.3 (latest 0.x) and works on 1.x: {code:java} $ conda list | grep pandas pandas0.25.3 py36he6710b0_0 $ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest' ... Running tests... -- /home/circleci/project/python/pyspark/context.py:238: FutureWarning: Python 3.6 support is deprecated in Spark 3.2. FutureWarning test_abs (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (2.353s) test_add (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (1.382s) test_and (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.265s) ok (6.569s) alOpsTest) ... test_eq (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (1.514s) test_floordiv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.910s) test_from_to_pandas (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.143s) test_ge (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.795s) test_gt (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.891s) test_invert (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.044s) test_isnull (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.097s) test_le (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.863s) test_lt (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.844s) test_mod (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.897s) test_mul (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.860s) test_ne (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (1.405s) test_neg (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.044s) test_or (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.160s) test_pow (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.821s) test_radd (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.081s) test_rand (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.100s) test_rfloordiv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.083s) test_rmod (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.050s) test_rmul (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.079s) test_ror (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.095s) test_rpow (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.078s) test_rsub (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.078s) test_rtruediv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.079s) test_sub (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.818s) test_truediv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.832s) == FAIL [1.611s]: test_eq (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) -- Traceback (most recent call last): File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 122, in assertPandasEqual **kwargs File "/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py", line 1248, in assert_series_equal assert_attr_equal('name', left, right, obj=obj) File "/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py", line 941, in assert_attr_equal raise_assert_detail(obj, msg, left_attr, right_attr) AssertionError: Series are different Attribute "name" are different [left]: that_numeric_cat [right]: None The above e
[jira] [Commented] (SPARK-37465) PySpark tests failing on Pandas 0.23
[ https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449584#comment-17449584 ] Willi Raschkowski commented on SPARK-37465: --- I also noticed another that {{CategoricalOpsTest}} fails on pandas 0.25.3 (latest 0.x) and works on 1.x: {code:java} $ conda list | grep pandas pandas0.25.3 py36he6710b0_0 $ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest' ... Running tests... -- /home/circleci/project/python/pyspark/context.py:238: FutureWarning: Python 3.6 support is deprecated in Spark 3.2. FutureWarning test_abs (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (2.353s) test_add (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (1.382s) test_and (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.265s) ok (6.569s) alOpsTest) ... test_eq (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (1.514s) test_floordiv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.910s) test_from_to_pandas (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.143s) test_ge (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.795s) test_gt (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.891s) test_invert (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.044s) test_isnull (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.097s) test_le (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.863s) test_lt (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (0.844s) test_mod (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.897s) test_mul (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.860s) test_ne (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... FAIL (1.405s) test_neg (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.044s) test_or (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.160s) test_pow (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.821s) test_radd (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.081s) test_rand (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.100s) test_rfloordiv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.083s) test_rmod (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.050s) test_rmul (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.079s) test_ror (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.095s) test_rpow (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.078s) test_rsub (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.078s) test_rtruediv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.079s) test_sub (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.818s) test_truediv (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) ... ok (0.832s) == FAIL [1.611s]: test_eq (pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) -- Traceback (most recent call last): File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 122, in assertPandasEqual **kwargs File "/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py", line 1248, in assert_series_equal assert_attr_equal('name', left, right, obj=obj) File "/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py", line 941, in assert_attr_equal raise_assert_detail(obj, msg, left_attr, right_attr) AssertionError: Series are different Attribute "name" are different [left]: that_numeric_cat [right]: None The above exception was the direct cause of the followin
[jira] [Updated] (SPARK-37465) PySpark tests failing on Pandas 0.23
[ https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-37465: -- Description: I was running Spark tests with Pandas {{0.23.4}} and got the error below. The minimum Pandas version is currently {{0.23.2}} [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222] in Pandas. {code:java} $ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTest.test_floordiv' ... == ERROR [5.785s]: test_floordiv (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest) -- Traceback (most recent call last): File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", line 128, in test_floordiv self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int)) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1069, in wrapper result = safe_na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1033, in safe_na_op return na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1027, in na_op result = missing.fill_zeros(result, x, y, op_name, fill_zeros) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", line 641, in fill_zeros signs = np.sign(y if name.startswith(('r', '__r')) else x) TypeError: ufunc 'sign' did not contain a loop with signature matching types dtype('bool') dtype('bool') {code} These are my relevant package versions: {code:java} $ conda list | grep -e numpy -e pyarrow -e pandas -e python # packages in environment at /home/circleci/miniconda/envs/python3: numpy 1.16.6 py36h0a8e133_3 numpy-base1.16.6 py36h41b4c56_3 pandas0.23.4 py36h04863e7_0 pyarrow 1.0.1 py36h6200943_36_cpuconda-forge python3.6.12 hcff3b4d_2anaconda python-dateutil 2.8.1 py_0anaconda python_abi3.6 1_cp36mconda-forg {code} was: I was running Spark tests with Pandas {{0.23.4}} and got the error below. The minimum Pandas version is currently {{0.23.2}} [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix (Github)|https://github.com/pandas-dev/pandas/pull/21160] in Pandas. {code:java} $ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTest.test_floordiv' ... == ERROR [5.785s]: test_floordiv (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest) -- Traceback (most recent call last): File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", line 128, in test_floordiv self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int)) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1069, in wrapper result = safe_na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1033, in safe_na_op return na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1027, in na_op result = missing.fill_zeros(result, x, y, op_name, fill_zeros) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", line 641, in fill_zeros signs = np.sign(y if name.startswith(('r', '__r')) else x) TypeError: ufunc 'sign' did not contain a loop with signature matching types dtype('bool') dtype('bool') {code} These are my relevant package versions: {code:java} $ conda list | grep -e numpy -e pyarrow -e pandas -e python # packages in environment at /home/circleci/miniconda/envs/python3: numpy 1.16.6 py36h0a8e133_3 numpy-base1.16.6 py36h41b4c56_3 pandas0.23.4 py36h04863e7_0 pyarrow 1.0.1 py36h6200943_36_cpuconda-forge python3.6.12 hcff3b4d_2anac
[jira] [Created] (SPARK-37465) PySpark tests failing on Pandas 0.23
Willi Raschkowski created SPARK-37465: - Summary: PySpark tests failing on Pandas 0.23 Key: SPARK-37465 URL: https://issues.apache.org/jira/browse/SPARK-37465 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Reporter: Willi Raschkowski I was running Spark tests with Pandas {{0.23.4}} and got the error below. The minimum Pandas version is currently {{0.23.2}} [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix (Github)|https://github.com/pandas-dev/pandas/pull/21160] in Pandas. {code:java} $ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTest.test_floordiv' ... == ERROR [5.785s]: test_floordiv (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest) -- Traceback (most recent call last): File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", line 128, in test_floordiv self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int)) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1069, in wrapper result = safe_na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1033, in safe_na_op return na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1027, in na_op result = missing.fill_zeros(result, x, y, op_name, fill_zeros) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", line 641, in fill_zeros signs = np.sign(y if name.startswith(('r', '__r')) else x) TypeError: ufunc 'sign' did not contain a loop with signature matching types dtype('bool') dtype('bool') {code} These are my relevant package versions: {code:java} $ conda list | grep -e numpy -e pyarrow -e pandas -e python # packages in environment at /home/circleci/miniconda/envs/python3: numpy 1.16.6 py36h0a8e133_3 numpy-base1.16.6 py36h41b4c56_3 pandas0.23.4 py36h04863e7_0 pyarrow 1.0.1 py36h6200943_36_cpuconda-forge python3.6.12 hcff3b4d_2anaconda python-dateutil 2.8.1 py_0anaconda python_abi3.6 1_cp36mconda-forg {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36768) Cannot resolve attribute with table reference
[ https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415631#comment-17415631 ] Willi Raschkowski commented on SPARK-36768: --- In case you wonder why we care or why we can't just re-write our query with an alias: Those queries without aliases are generated and are meant to be compatible with both Spark SQL and another SQL database (where they work). > Cannot resolve attribute with table reference > - > > Key: SPARK-36768 > URL: https://issues.apache.org/jira/browse/SPARK-36768 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.7, 3.0.3, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark seems in some cases unable to resolve attributes that contain > multi-part names where the first parts reference a table. Here's a repro: > {code:python} > >>> spark.range(3).toDF("col").write.parquet("testdata") > # Single name part attribute is fine > >>> spark.sql("SELECT col FROM parquet.testdata").show() > +---+ > |col| > +---+ > | 1| > | 0| > | 2| > +---+ > # Name part with the table reference fails > >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() > AnalysisException: cannot resolve '`parquet.testdata.col`' given input > columns: [col]; line 1 pos 7; > 'Project ['parquet.testdata.col] > +- Relation[col#50L] parquet > {code} > The expected behavior is that {{parquet.testdata.col}} is recognized as > referring to attribute {{col}} in {{parquet.testdata}} (you'd expect > {{AttributeSeq.resolve}} matches [this > case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36768) Cannot resolve attribute with table reference
[ https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415630#comment-17415630 ] Willi Raschkowski commented on SPARK-36768: --- In the debugger I see [on this line|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L227] {{collectMatches}} doesn't produce any matches because {{qualified3Part}} is an empty map. And it seems to be an empty map because the {{"col"}} attribute in this {{AttributeSeq}} has an empty qualifiers. On the other hand, if you do {code:sql} SELECT t.col FROM parquet.testdata t {code} the {{"col"}} attribute in the {{AttributeSeq}} has a {{"t"}} as attribute. And thus we get matches [on this line|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L253] when filtering for the {{"t"}} qualifier. Naively, that makes we wonder why in the {{"parquet.testdata.col"}} case {{"parquet.testdata"}} is not part of the {{"col"}} attribute's qualifier, but when we alias the table the alias is included as qualifier. > Cannot resolve attribute with table reference > - > > Key: SPARK-36768 > URL: https://issues.apache.org/jira/browse/SPARK-36768 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.7, 3.0.3, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark seems in some cases unable to resolve attributes that contain > multi-part names where the first parts reference a table. Here's a repro: > {code:python} > >>> spark.range(3).toDF("col").write.parquet("testdata") > # Single name part attribute is fine > >>> spark.sql("SELECT col FROM parquet.testdata").show() > +---+ > |col| > +---+ > | 1| > | 0| > | 2| > +---+ > # Name part with the table reference fails > >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() > AnalysisException: cannot resolve '`parquet.testdata.col`' given input > columns: [col]; line 1 pos 7; > 'Project ['parquet.testdata.col] > +- Relation[col#50L] parquet > {code} > The expected behavior is that {{parquet.testdata.col}} is recognized as > referring to attribute {{col}} in {{parquet.testdata}} (you'd expect > {{AttributeSeq.resolve}} matches [this > case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36768) Cannot resolve attribute with table reference
[ https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415621#comment-17415621 ] Willi Raschkowski commented on SPARK-36768: --- This also reproduces on master at time of writing. > Cannot resolve attribute with table reference > - > > Key: SPARK-36768 > URL: https://issues.apache.org/jira/browse/SPARK-36768 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.7, 3.0.3, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark seems in some cases unable to resolve attributes that contain > multi-part names where the first parts reference a table. Here's a repro: > {code:python} > >>> spark.range(3).toDF("col").write.parquet("testdata") > # Single name part attribute is fine > >>> spark.sql("SELECT col FROM parquet.testdata").show() > +---+ > |col| > +---+ > | 1| > | 0| > | 2| > +---+ > # Name part with the table reference fails > >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() > AnalysisException: cannot resolve '`parquet.testdata.col`' given input > columns: [col]; line 1 pos 7; > 'Project ['parquet.testdata.col] > +- Relation[col#50L] parquet > {code} > The expected behavior is that {{parquet.testdata.col}} is recognized as > referring to attribute {{col}} in {{parquet.testdata}} (you'd expect > {{AttributeSeq.resolve}} matches [this > case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36768) Cannot resolve attribute with table reference
[ https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-36768: -- Description: Spark seems in some cases unable to resolve attributes that contain multi-part names where the first parts reference a table. Here's a repro: {code:python} >>> spark.range(3).toDF("col").write.parquet("testdata") # Single name part attribute is fine >>> spark.sql("SELECT col FROM parquet.testdata").show() +---+ |col| +---+ | 1| | 0| | 2| +---+ # Name part with the table reference fails >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: [col]; line 1 pos 7; 'Project ['parquet.testdata.col] +- Relation[col#50L] parquet {code} The expected behavior is that {{parquet.testdata.col}} is recognized as referring to attribute {{col}} in {{parquet.testdata}} (you'd expect {{AttributeSeq.resolve}} matches [this case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]). was: Spark seems in some cases unable to resolve attributes that contain multi-part names where the first parts reference a table. Here's a repro: {code:python} >>> spark.range(3).toDF("col").write.parquet("testdata") # Single name part attribute is fine >>> spark.sql("SELECT col FROM parquet.testdata").show() +---+ |col| +---+ | 1| | 0| | 2| +---+ # Name part with the table reference fails >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: [col]; line 1 pos 7; 'Project ['parquet.testdata.col] +- Relation[col#50L] parquet {code} The expected behavior is that {{parquet.testdata.col}} is recognized as referring to attribute {{col}} in {{parquet.testdata}}. This also reproduces on master at time of writing. > Cannot resolve attribute with table reference > - > > Key: SPARK-36768 > URL: https://issues.apache.org/jira/browse/SPARK-36768 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.7, 3.0.3, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark seems in some cases unable to resolve attributes that contain > multi-part names where the first parts reference a table. Here's a repro: > {code:python} > >>> spark.range(3).toDF("col").write.parquet("testdata") > # Single name part attribute is fine > >>> spark.sql("SELECT col FROM parquet.testdata").show() > +---+ > |col| > +---+ > | 1| > | 0| > | 2| > +---+ > # Name part with the table reference fails > >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() > AnalysisException: cannot resolve '`parquet.testdata.col`' given input > columns: [col]; line 1 pos 7; > 'Project ['parquet.testdata.col] > +- Relation[col#50L] parquet > {code} > The expected behavior is that {{parquet.testdata.col}} is recognized as > referring to attribute {{col}} in {{parquet.testdata}} (you'd expect > {{AttributeSeq.resolve}} matches [this > case|https://github.com/apache/spark/blob/b665782f0d3729928be4ca897ec2eb990b714879/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L214-L239]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36768) Cannot resolve attribute with table reference
[ https://issues.apache.org/jira/browse/SPARK-36768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-36768: -- Description: Spark seems in some cases unable to resolve attributes that contain multi-part names where the first parts reference a table. Here's a repro: {code:python} >>> spark.range(3).toDF("col").write.parquet("testdata") # Single name part attribute is fine >>> spark.sql("SELECT col FROM parquet.testdata").show() +---+ |col| +---+ | 1| | 0| | 2| +---+ # Name part with the table reference fails >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: [col]; line 1 pos 7; 'Project ['parquet.testdata.col] +- Relation[col#50L] parquet {code} The expected behavior is that {{parquet.testdata.col}} is recognized as referring to attribute {{col}} in {{parquet.testdata}}. This also reproduces on master at time of writing. was: Spark seems in some cases unable to resolve attributes that contain multi-part names where the first parts reference a table. Here's a repro: {code:python} >>> spark.range(3).toDF("col").write.parquet("testdata") # Single name part attribute is fine >>> spark.sql("SELECT col FROM parquet.testdata").show() +---+ |col| +---+ | 1| | 0| | 2| +---+ # Name part with the table reference fails >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: [col]; line 1 pos 7; 'Project ['parquet.testdata.col] +- Relation[col#50L] parquet {code} This also reproduces on master at time of writing. > Cannot resolve attribute with table reference > - > > Key: SPARK-36768 > URL: https://issues.apache.org/jira/browse/SPARK-36768 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.7, 3.0.3, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark seems in some cases unable to resolve attributes that contain > multi-part names where the first parts reference a table. Here's a repro: > {code:python} > >>> spark.range(3).toDF("col").write.parquet("testdata") > # Single name part attribute is fine > >>> spark.sql("SELECT col FROM parquet.testdata").show() > +---+ > |col| > +---+ > | 1| > | 0| > | 2| > +---+ > # Name part with the table reference fails > >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() > AnalysisException: cannot resolve '`parquet.testdata.col`' given input > columns: [col]; line 1 pos 7; > 'Project ['parquet.testdata.col] > +- Relation[col#50L] parquet > {code} > The expected behavior is that {{parquet.testdata.col}} is recognized as > referring to attribute {{col}} in {{parquet.testdata}}. > This also reproduces on master at time of writing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36768) Cannot resolve attribute with table reference
Willi Raschkowski created SPARK-36768: - Summary: Cannot resolve attribute with table reference Key: SPARK-36768 URL: https://issues.apache.org/jira/browse/SPARK-36768 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.1.2, 3.0.3, 2.4.7 Reporter: Willi Raschkowski Spark seems in some cases unable to resolve attributes that contain multi-part names where the first parts reference a table. Here's a repro: {code:python} >>> spark.range(3).toDF("col").write.parquet("testdata") # Single name part attribute is fine >>> spark.sql("SELECT col FROM parquet.testdata").show() +---+ |col| +---+ | 1| | 0| | 2| +---+ # Name part with the table reference fails >>> spark.sql("SELECT parquet.testdata.col FROM parquet.testdata").show() AnalysisException: cannot resolve '`parquet.testdata.col`' given input columns: [col]; line 1 pos 7; 'Project ['parquet.testdata.col] +- Relation[col#50L] parquet {code} This also reproduces on master at time of writing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35324) Spark SQL configs not respected in RDDs
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398610#comment-17398610 ] Willi Raschkowski commented on SPARK-35324: --- [~mgekk], apologies for the direct ping. Do you know who could look at this? Just hoping to get more jobs upgraded to Spark 3. To summarize the issue: As you know some datetime reads/writes/parses in Spark 3 rely on additional configs, e.g. reading pre-1900 timestamps. It seems that even if you set those configs they don't get propagated to RDDs and jobs fail as if the config wasn't set. > Spark SQL configs not respected in RDDs > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot > 2021-05-06 at 00.35.10.png > > > When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in > actions on the resulting dataframe. But it's ignored in actions on > dataframe's RDD. > E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to > be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as > you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the > override and read the config value as {{EXCEPTION}}. > For instance: > {code:java|title=test.csv} > date > 2/6/18 > {code} > {code:java} > scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") > scala> val df = { > | spark.read > | .option("header", "true") > | .option("dateFormat", "MM/dd/yy") > | .schema("date date") > | .csv("/Users/wraschkowski/Downloads/test.csv") > | } > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > +--+ > > | date| > +--+ > |2018-02-06| > +--+ > scala> df.count > res3: Long = 1 > scala> df.rdd.count > 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can > set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior > before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime > string. > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) > at > scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396) > at > org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collect
[jira] [Comment Edited] (SPARK-35324) Spark SQL configs not respected in RDDs
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398607#comment-17398607 ] Willi Raschkowski edited comment on SPARK-35324 at 8/13/21, 11:44 AM: -- Here's an example of that: {code:title=Missing config error} Lost task 20.0 in stage 1.0 (TID 1, redacted, executor 1): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:474) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:477) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar. at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInWrite(DataSourceUtils.scala:143) at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteDateRebaseFuncInWrite$1(DataSourceUtils.scala:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4(ParquetWriteSupport.scala:169) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4$adapted(ParquetWriteSupport.scala:168) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:463) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:136) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:451) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:136) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54) at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286) ... 9 more {code} where we're already setting {{...datetimeRebaseModeInWrite: legacy}}. I appreciate that we can use SQL instead of RDDs for most use cases but in some cases we're still forced to go down to RDDs and since Spark 3 we're effectively blocking those cases from reading / writing old datetimes. was (Author: raschkowski): Here's an example of that: {code:title=Rebase error} Lost task 20.0 in stage 1.0 (
[jira] [Comment Edited] (SPARK-35324) Spark SQL configs not respected in RDDs
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398607#comment-17398607 ] Willi Raschkowski edited comment on SPARK-35324 at 8/13/21, 11:43 AM: -- Here's an example of that: {code:title=Rebase error} Lost task 20.0 in stage 1.0 (TID 1, redacted, executor 1): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:474) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:477) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar. at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInWrite(DataSourceUtils.scala:143) at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteDateRebaseFuncInWrite$1(DataSourceUtils.scala:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4(ParquetWriteSupport.scala:169) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4$adapted(ParquetWriteSupport.scala:168) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:463) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:136) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:451) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:136) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54) at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286) ... 9 more {code} where we're already setting {{...datetimeRebaseModeInWrite: legacy}}. I appreciate that we can use SQL instead of RDDs for most use cases but in some cases we're still forced to go down to RDDs and since Spark 3 we're effectively blocking those cases from reading / writing old datetimes. was (Author: raschkowski): Here's an example of that: {code:title=Rebase error} Lost task 20.0 in stage 1.0 (TID 1, r
[jira] [Comment Edited] (SPARK-35324) Spark SQL configs not respected in RDDs
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398607#comment-17398607 ] Willi Raschkowski edited comment on SPARK-35324 at 8/13/21, 11:43 AM: -- Here's an example of that: {code:title=Rebase error} Lost task 20.0 in stage 1.0 (TID 1, redacted, executor 1): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:474) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:477) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar. at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInWrite(DataSourceUtils.scala:143) at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteDateRebaseFuncInWrite$1(DataSourceUtils.scala:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4(ParquetWriteSupport.scala:169) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4$adapted(ParquetWriteSupport.scala:168) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:463) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:136) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:451) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:136) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54) at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286) ... 9 more {code} where we're already setting {{...rebaseModeInWrite: legacy}}. I appreciate that we can use SQL instead of RDDs for most use cases but in some cases we're still forced to go down to RDDs and since Spark 3 we're effectively blocking those cases from reading / writing old datetimes. was (Author: raschkowski): Here's an example of that: {code:title=Rebase error} Lost task 20.0 in stage 1.0 (TID 1, redacted,
[jira] [Commented] (SPARK-35324) Spark SQL configs not respected in RDDs
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398607#comment-17398607 ] Willi Raschkowski commented on SPARK-35324: --- Here's an example of that: {code:title=Rebase error} Lost task 20.0 in stage 1.0 (TID 1, redacted, executor 1): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:474) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:477) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: writing dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z into Parquet files can be dangerous, as the files may be read by Spark 2.x or legacy versions of Hive later, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during writing, to get maximum interoperability. Or set spark.sql.legacy.parquet.datetimeRebaseModeInWrite to 'CORRECTED' to write the datetime values as it is, if you are 100% sure that the written files will only be read by Spark 3.0+ or other systems that use Proleptic Gregorian calendar. at org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInWrite(DataSourceUtils.scala:143) at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$creteDateRebaseFuncInWrite$1(DataSourceUtils.scala:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4(ParquetWriteSupport.scala:169) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$makeWriter$4$adapted(ParquetWriteSupport.scala:168) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$writeFields$1(ParquetWriteSupport.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeField(ParquetWriteSupport.scala:463) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:136) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:451) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:136) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54) at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286) ... 9 more {code} I appreciate that we can use SQL instead of RDDs for most use cases but in some cases we're still forced to go down to RDDs and since Spark 3 we're effectively blocking those cases from reading / writing old datetimes. > Spark SQL configs not respected in RDDs > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Typ
[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387298#comment-17387298 ] Willi Raschkowski commented on SPARK-36034: --- [~maxgekk], much appreciated, thanks! > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Assignee: Max Gekk >Priority: Blocker > Labels: correctness > Fix For: 3.2.0, 3.1.3, 3.0.4, 3.3.0 > > > We're seeing incorrect date filters on Parquet files written by Spark 2 or by > Spark 3 with legacy rebase mode. > This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): > {code:title=Good (Corrected Mode)} > >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > >>> "CORRECTED") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") > >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = > >>> '0001-01-01'").show() > +--+ > | date| > +--+ > |0001-01-01| > +--+ > {code} > This is how we get incorrect results in _legacy_ mode, in this case the > filter is dropping rows it shouldn't: > {code:title=Bad (Legacy Mode)} > In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > "LEGACY") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") > >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").show() > ++ > |date| > ++ > ++ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").explain() > == Physical Plan == > *(1) Filter (isnotnull(date#154) AND (date#154 = -719162)) > +- *(1) ColumnarToRow >+- FileScan parquet [date#154] Batched: true, DataFilters: > [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: > InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar..., > PartitionFilters: [], PushedFilters: [IsNotNull(date), > EqualTo(date,0001-01-01)], ReadSchema: struct > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376795#comment-17376795 ] Willi Raschkowski edited comment on SPARK-36034 at 7/7/21, 7:00 PM: To show you the metadata of the Parquet files: {code:java|title=Corrected} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_corrected file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0] {code} {code:java|title=Legacy} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} extra: org.apache.spark.legacyDateTime = file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0] {code} Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has {{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} filter doesn't work. Maybe we need an "inverse rebase" before push-down? was (Author: raschkowski): To show you the metadata of the Parquet files: {code:title=Corrected} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_corrected file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0] {code} {code:title=Legacy} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} extra: org.apache.spark.legacyDateTime = file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0] {code} Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has {{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} filter doesn't work. Maybe we need an "inverse rebase" before pushing down? > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https:
[jira] [Comment Edited] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376788#comment-17376788 ] Willi Raschkowski edited comment on SPARK-36034 at 7/7/21, 6:58 PM: You can probably guess but just for completeness, we're also seeing this in Spark 2-written files: {code:java} >>> spark.read.parquet("date_written_by_spark2").selectExpr("date", "date = >>> '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("date_written_by_spark2").filter("date = >>> '0001-01-01'").show() ++ |date| ++ ++ {code} was (Author: raschkowski): You can probably guess but just for completeness, we're also seeing this in Spark 2-written files: {code} >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark2").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark2").filter("date >>> = '0001-01-01'").show() ++ |date| ++ ++ {code} > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > We're seeing incorrect date filters on Parquet files written by Spark 2 or by > Spark 3 with legacy rebase mode. > This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): > {code:title=Good (Corrected Mode)} > >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > >>> "CORRECTED") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") > >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = > >>> '0001-01-01'").show() > +--+ > | date| > +--+ > |0001-01-01| > +--+ > {code} > This is how we get incorrect results in _legacy_ mode, in this case the > filter is dropping rows it shouldn't: > {code:title=Bad (Legacy Mode)} > In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > "LEGACY") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") > >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").show() > ++ > |date| > ++ > ++ > >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = > >>> '0001-01-01'").explain() > == Physical Plan == > *(1) Filter (isnotnull(date#154) AND (date#154 = -719162)) > +- *(1) ColumnarToRow >+- FileScan parquet [date#154] Batched: true, DataFilters: > [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: > InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar..., > PartitionFilters: [], PushedFilters: [IsNotNull(date), > EqualTo(date,0001-01-01)], ReadSchema: struct > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-36034: -- Description: We're seeing incorrect date filters on Parquet files written by Spark 2 or by Spark 3 with legacy rebase mode. This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): {code:title=Good (Corrected Mode)} >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", >>> "CORRECTED") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = >>> '0001-01-01'").show() +--+ | date| +--+ |0001-01-01| +--+ {code} This is how we get incorrect results in _legacy_ mode, in this case the filter is dropping rows it shouldn't: {code:title=Bad (Legacy Mode)} In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = >>> '0001-01-01'").show() ++ |date| ++ ++ >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = >>> '0001-01-01'").explain() == Physical Plan == *(1) Filter (isnotnull(date#154) AND (date#154 = -719162)) +- *(1) ColumnarToRow +- FileScan parquet [date#154] Batched: true, DataFilters: [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar..., PartitionFilters: [], PushedFilters: [IsNotNull(date), EqualTo(date,0001-01-01)], ReadSchema: struct {code} was: We're seeing incorrect date filters on Parquet files written by Spark 2 or by Spark 3 with legacy rebase mode. This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): {code:title=Good (Corrected Mode)} >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", >>> "CORRECTED") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = >>> '0001-01-01'").show() +--+ | date| +--+ |0001-01-01| +--+ {code} This is how we get incorrect results in _legacy_ mode, in this case the filter is dropping rows it shouldn't: {code:title=Bad (Legacy Mode)} In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = >>> '0001-01-01'").show() ++ |date| ++ ++ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date >>> = '0001-01-01'").explain() == Physical Plan == *(1) Filter (isnotnull(date#122) AND (date#122 = -719162)) +- *(1) ColumnarToRow +- FileScan parquet [date#122] Batched: true, DataFilters: [isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], PartitionFilters: [], PushedFilters: [IsNotNull(date), EqualTo(date,0001-01-01)], ReadSchema: struct {code} > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3
[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-36034: -- Description: We're seeing incorrect date filters on Parquet files written by Spark 2 or by Spark 3 with legacy rebase mode. This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): {code:title=Good (Corrected Mode)} >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", >>> "CORRECTED") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = >>> '0001-01-01'").show() +--+ | date| +--+ |0001-01-01| +--+ {code} This is how we get incorrect results in _legacy_ mode, in this case the filter is dropping rows it shouldn't: {code:title=Bad (Legacy Mode)} In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = >>> '0001-01-01'").show() ++ |date| ++ ++ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date >>> = '0001-01-01'").explain() == Physical Plan == *(1) Filter (isnotnull(date#122) AND (date#122 = -719162)) +- *(1) ColumnarToRow +- FileScan parquet [date#122] Batched: true, DataFilters: [isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], PartitionFilters: [], PushedFilters: [IsNotNull(date), EqualTo(date,0001-01-01)], ReadSchema: struct {code} was: We're seeing incorrect date filters on Parquet files written by Spark 2 or by Spark 3 with legacy rebase mode. This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): {code:title=Good (Corrected Mode)} >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", >>> "CORRECTED") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date >>> = '0001-01-01'").show() +--+ | date| +--+ |0001-01-01| +--+ {code} This is how we get incorrect results in _legacy_ mode, in this case the filter is dropping rows it shouldn't: {code:title=Bad (Legacy Mode)} In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date >>> = '0001-01-01'").show() ++ |date| ++ ++ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date >>> = '0001-01-01'").explain() == Physical Plan == *(1) Filter (isnotnull(date#122) AND (date#122 = -719162)) +- *(1) ColumnarToRow +- FileScan parquet [date#122] Batched: true, DataFilters: [isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], PartitionFilters: [], PushedFilters: [IsNotNull(date), EqualTo(date,0001-01-01)], ReadSchema: struct {code} > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36
[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376796#comment-17376796 ] Willi Raschkowski commented on SPARK-36034: --- [~maxgekk], I think you might be most familiar with these code paths? > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > We're seeing incorrect date filters on Parquet files written by Spark 2 or by > Spark 3 with legacy rebase mode. > This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): > {code:title=Good (Corrected Mode)} > >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > >>> "CORRECTED") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date > >>> = '0001-01-01'").show() > +--+ > | date| > +--+ > |0001-01-01| > +--+ > {code} > This is how we get incorrect results in _legacy_ mode, in this case the > filter is dropping rows it shouldn't: > {code:title=Bad (Legacy Mode)} > In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > "LEGACY") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date > >>> = '0001-01-01'").show() > ++ > |date| > ++ > ++ > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date > >>> = '0001-01-01'").explain() > == Physical Plan == > *(1) Filter (isnotnull(date#122) AND (date#122 = -719162)) > +- *(1) ColumnarToRow >+- FileScan parquet [date#122] Batched: true, DataFilters: > [isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: > InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], > PartitionFilters: [], PushedFilters: [IsNotNull(date), > EqualTo(date,0001-01-01)], ReadSchema: struct > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376795#comment-17376795 ] Willi Raschkowski edited comment on SPARK-36034 at 7/7/21, 6:54 PM: To show you the metadata of the Parquet files: {code:title=Corrected} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_corrected file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0] {code} {code:title=Legacy} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} extra: org.apache.spark.legacyDateTime = file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0] {code} Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has {{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} filter doesn't work. Maybe we need an "inverse rebase" before pushing down? was (Author: raschkowski): To show you the metadata of the Parquet files: {code:title=Corrected} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_corrected file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0] {code} {code:title=Legacy} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} extra: org.apache.spark.legacyDateTime = file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0] {code} Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has {{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} filter doesn't work. > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > P
[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376795#comment-17376795 ] Willi Raschkowski commented on SPARK-36034: --- To show you the metadata of the Parquet files: {code:title=Corrected} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_corrected file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_corrected/part-0-77ac86ed-1488-4b1d-882b-6cef698174ac-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-01-01, max: 0001-01-01, num_nulls: 0] {code} {code:title=Legacy} $ parquet-tools meta /Volumes/git/pds/190025/out/date_written_by_spark3_legacy file: file:/Volumes/git/pds/190025/out/date_written_by_spark3_legacy/part-0-2c5b4961-6908-4f6c-b2d8-e706d793aae5-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.version = 3.1.2 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"date","type":"date","nullable":false,"metadata":{}}]} extra: org.apache.spark.legacyDateTime = file schema: spark_schema date:REQUIRED INT32 L:DATE R:0 D:0 row group 1: RC:1 TS:49 OFFSET:4 date: INT32 SNAPPY DO:0 FPO:4 SZ:51/49/0.96 VC:1 ENC:BIT_PACKED,PLAIN ST:[min: 0001-12-30, max: 0001-12-30, num_nulls: 0] {code} Mind how _corrected_ has {{0001-01-01}} as value, while _legacy_ has {{0001-12-30}}. This gives me the feeling that pushing down the {{0001-01-01}} filter doesn't work. > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > We're seeing incorrect date filters on Parquet files written by Spark 2 or by > Spark 3 with legacy rebase mode. > This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): > {code:title=Good (Corrected Mode)} > >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > >>> "CORRECTED") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date > >>> = '0001-01-01'").show() > +--+ > | date| > +--+ > |0001-01-01| > +--+ > {code} > This is how we get incorrect results in _legacy_ mode, in this case the > filter is dropping rows it shouldn't: > {code:title=Bad (Legacy Mode)} > In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > "LEGACY") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date > >>> = '0001-01-01'").show() > ++ > |date| > ++ > ++ > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date > >>> = '0001-01-01'").explain() > == Physical Plan == > *(1) Filter (isnotnull(date#122) AND (date#122 = -719162)) > +- *(1) ColumnarToRow >+- FileScan parquet [date#122] Batched: tru
[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
[ https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376788#comment-17376788 ] Willi Raschkowski commented on SPARK-36034: --- You can probably guess but just for completeness, we're also seeing this in Spark 2-written files: {code} >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark2").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark2").filter("date >>> = '0001-01-01'").show() ++ |date| ++ ++ {code} > Incorrect datetime filter when reading Parquet files written in legacy mode > --- > > Key: SPARK-36034 > URL: https://issues.apache.org/jira/browse/SPARK-36034 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > We're seeing incorrect date filters on Parquet files written by Spark 2 or by > Spark 3 with legacy rebase mode. > This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): > {code:title=Good (Corrected Mode)} > >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > >>> "CORRECTED") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date > >>> = '0001-01-01'").show() > +--+ > | date| > +--+ > |0001-01-01| > +--+ > {code} > This is how we get incorrect results in _legacy_ mode, in this case the > filter is dropping rows it shouldn't: > {code:title=Bad (Legacy Mode)} > In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", > "LEGACY") > >>> spark.sql("SELECT DATE '0001-01-01' AS > >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", > >>> "date = '0001-01-01'").show() > +--+---+ > | date|(date = 0001-01-01)| > +--+---+ > |0001-01-01| true| > +--+---+ > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date > >>> = '0001-01-01'").show() > ++ > |date| > ++ > ++ > >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date > >>> = '0001-01-01'").explain() > == Physical Plan == > *(1) Filter (isnotnull(date#122) AND (date#122 = -719162)) > +- *(1) ColumnarToRow >+- FileScan parquet [date#122] Batched: true, DataFilters: > [isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: > InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], > PartitionFilters: [], PushedFilters: [IsNotNull(date), > EqualTo(date,0001-01-01)], ReadSchema: struct > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode
Willi Raschkowski created SPARK-36034: - Summary: Incorrect datetime filter when reading Parquet files written in legacy mode Key: SPARK-36034 URL: https://issues.apache.org/jira/browse/SPARK-36034 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.1.2 Reporter: Willi Raschkowski We're seeing incorrect date filters on Parquet files written by Spark 2 or by Spark 3 with legacy rebase mode. This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2): {code:title=Good (Corrected Mode)} >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", >>> "CORRECTED") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date >>> = '0001-01-01'").show() +--+ | date| +--+ |0001-01-01| +--+ {code} This is how we get incorrect results in _legacy_ mode, in this case the filter is dropping rows it shouldn't: {code:title=Bad (Legacy Mode)} In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS >>> date").write.mode("overwrite").parquet("/Volumes/git/pds/190025/out/date_written_by_spark3") >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").selectExpr("date", >>> "date = '0001-01-01'").show() +--+---+ | date|(date = 0001-01-01)| +--+---+ |0001-01-01| true| +--+---+ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date >>> = '0001-01-01'").show() ++ |date| ++ ++ >>> spark.read.parquet("/Volumes/git/pds/190025/out/date_written_by_spark3").where("date >>> = '0001-01-01'").explain() == Physical Plan == *(1) Filter (isnotnull(date#122) AND (date#122 = -719162)) +- *(1) ColumnarToRow +- FileScan parquet [date#122] Batched: true, DataFilters: [isnotnull(date#122), (date#122 = -719162)], Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/git/pds/190025/out/date_written_by_spark3], PartitionFilters: [], PushedFilters: [IsNotNull(date), EqualTo(date,0001-01-01)], ReadSchema: struct {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery
[ https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-35673: -- Issue Type: Bug (was: Task) > Spark fails on unrecognized hint in subquery > > > Key: SPARK-35673 > URL: https://issues.apache.org/jira/browse/SPARK-35673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark queries to fail on unrecognized hints in subqueries. An example to > reproduce: > {code:sql} > SELECT /*+ use_hash */ 42; > -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() > -- 42 > SELECT * > FROM ( > SELECT /*+ use_hash */ 42 > ); > -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() > -- Error in query: unresolved operator 'Project [*]; > -- 'Project [*] > -- +- SubqueryAlias __auto_generated_subquery_name > --+- Project [42 AS 42#2] > -- +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery
[ https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-35673: -- Description: Spark queries to fail on unrecognized hints in subqueries. An example to reproduce: {code:sql} SELECT /*+ use_hash */ 42; -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() -- 42 SELECT * FROM ( SELECT /*+ use_hash */ 42 ); -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() -- Error in query: unresolved operator 'Project [*]; -- 'Project [*] -- +- SubqueryAlias __auto_generated_subquery_name --+- Project [42 AS 42#2] -- +- OneRowRelation {code} was: Spark queries seem to fail on unrecognized hints in subqueries. An example to reproduce: {code:sql} SELECT /*+ use_hash */ 42; -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() -- 42 SELECT * FROM ( SELECT /*+ use_hash */ 42 ); -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() -- Error in query: unresolved operator 'Project [*]; -- 'Project [*] -- +- SubqueryAlias __auto_generated_subquery_name --+- Project [42 AS 42#2] -- +- OneRowRelation {code} > Spark fails on unrecognized hint in subquery > > > Key: SPARK-35673 > URL: https://issues.apache.org/jira/browse/SPARK-35673 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark queries to fail on unrecognized hints in subqueries. An example to > reproduce: > {code:sql} > SELECT /*+ use_hash */ 42; > -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() > -- 42 > SELECT * > FROM ( > SELECT /*+ use_hash */ 42 > ); > -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() > -- Error in query: unresolved operator 'Project [*]; > -- 'Project [*] > -- +- SubqueryAlias __auto_generated_subquery_name > --+- Project [42 AS 42#2] > -- +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery
[ https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-35673: -- Description: Spark queries seem to fail on unrecognized hints in subqueries. An example to reproduce: {code:sql} SELECT /*+ use_hash */ 42; -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() -- 42 SELECT * FROM ( SELECT /*+ use_hash */ 42 ); -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() -- Error in query: unresolved operator 'Project [*]; -- 'Project [*] -- +- SubqueryAlias __auto_generated_subquery_name --+- Project [42 AS 42#2] -- +- OneRowRelation {code} was: Spark fails on unrecognized hint in subquery. To reproduce: {code:sql} SELECT /*+ use_hash */ 42; -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() -- 42 SELECT * FROM ( SELECT /*+ use_hash */ 42 ); -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() -- Error in query: unresolved operator 'Project [*]; -- 'Project [*] -- +- SubqueryAlias __auto_generated_subquery_name --+- Project [42 AS 42#2] -- +- OneRowRelation {code} > Spark fails on unrecognized hint in subquery > > > Key: SPARK-35673 > URL: https://issues.apache.org/jira/browse/SPARK-35673 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark queries seem to fail on unrecognized hints in subqueries. An example to > reproduce: > {code:sql} > SELECT /*+ use_hash */ 42; > -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() > -- 42 > SELECT * > FROM ( > SELECT /*+ use_hash */ 42 > ); > -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() > -- Error in query: unresolved operator 'Project [*]; > -- 'Project [*] > -- +- SubqueryAlias __auto_generated_subquery_name > --+- Project [42 AS 42#2] > -- +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery
[ https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-35673: -- Description: Spark fails on unrecognized hint in subquery. To reproduce: {code:sql} SELECT /*+ use_hash */ 42; -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() -- 42 SELECT * FROM ( SELECT /*+ use_hash */ 42 ); -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() -- Error in query: unresolved operator 'Project [*]; -- 'Project [*] -- +- SubqueryAlias __auto_generated_subquery_name --+- Project [42 AS 42#2] -- +- OneRowRelation {code} was: Spark fails on unrecognized hint in subquery. To reproduce, try {code:sql} -- This succeeds with warning SELECT /*+ use_hash */ 42; -- This fails SELECT * FROM ( SELECT /*+ use_hash */ 42 ); {code} The first statement gives you {code} 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() 42 {code} while the second statement gives you {code} 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() Error in query: unresolved operator 'Project [*]; 'Project [*] +- SubqueryAlias __auto_generated_subquery_name +- Project [42 AS 42#2] +- OneRowRelation {code} > Spark fails on unrecognized hint in subquery > > > Key: SPARK-35673 > URL: https://issues.apache.org/jira/browse/SPARK-35673 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.1.2 >Reporter: Willi Raschkowski >Priority: Major > > Spark fails on unrecognized hint in subquery. > To reproduce: > {code:sql} > SELECT /*+ use_hash */ 42; > -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() > -- 42 > SELECT * > FROM ( > SELECT /*+ use_hash */ 42 > ); > -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() > -- Error in query: unresolved operator 'Project [*]; > -- 'Project [*] > -- +- SubqueryAlias __auto_generated_subquery_name > --+- Project [42 AS 42#2] > -- +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35673) Spark fails on unrecognized hint in subquery
Willi Raschkowski created SPARK-35673: - Summary: Spark fails on unrecognized hint in subquery Key: SPARK-35673 URL: https://issues.apache.org/jira/browse/SPARK-35673 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.1.2, 3.1.1, 3.0.2 Reporter: Willi Raschkowski Spark fails on unrecognized hint in subquery. To reproduce, try {code:sql} -- This succeeds with warning SELECT /*+ use_hash */ 42; -- This fails SELECT * FROM ( SELECT /*+ use_hash */ 42 ); {code} The first statement gives you {code} 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash() 42 {code} while the second statement gives you {code} 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash() Error in query: unresolved operator 'Project [*]; 'Project [*] +- SubqueryAlias __auto_generated_subquery_name +- Project [42 AS 42#2] +- OneRowRelation {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35324) Spark SQL configs not respected in RDDs
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351728#comment-17351728 ] Willi Raschkowski commented on SPARK-35324: --- We found the same issue with {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}. It seems that RDDs are generally not respecting {{spark.sql.*}} configs? > Spark SQL configs not respected in RDDs > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot > 2021-05-06 at 00.35.10.png > > > When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in > actions on the resulting dataframe. But it's ignored in actions on > dataframe's RDD. > E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to > be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as > you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the > override and read the config value as {{EXCEPTION}}. > For instance: > {code:java|title=test.csv} > date > 2/6/18 > {code} > {code:java} > scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") > scala> val df = { > | spark.read > | .option("header", "true") > | .option("dateFormat", "MM/dd/yy") > | .schema("date date") > | .csv("/Users/wraschkowski/Downloads/test.csv") > | } > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > +--+ > > | date| > +--+ > |2018-02-06| > +--+ > scala> df.count > res3: Long = 1 > scala> df.rdd.count > 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can > set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior > before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime > string. > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) > at > scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396) > at > org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866) > at org.apache.spark.rdd.RDD.$an
[jira] [Updated] (SPARK-35324) Spark SQL configs not respected in RDDs
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-35324: -- Summary: Spark SQL configs not respected in RDDs (was: Time parser policy not respected in RDD) > Spark SQL configs not respected in RDDs > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot > 2021-05-06 at 00.35.10.png > > > When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in > actions on the resulting dataframe. But it's ignored in actions on > dataframe's RDD. > E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to > be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as > you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the > override and read the config value as {{EXCEPTION}}. > For instance: > {code:java|title=test.csv} > date > 2/6/18 > {code} > {code:java} > scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") > scala> val df = { > | spark.read > | .option("header", "true") > | .option("dateFormat", "MM/dd/yy") > | .schema("date date") > | .csv("/Users/wraschkowski/Downloads/test.csv") > | } > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > +--+ > > | date| > +--+ > |2018-02-06| > +--+ > scala> df.count > res3: Long = 1 > scala> df.rdd.count > 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can > set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior > before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime > string. > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) > at > scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396) > at > org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866) > at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253) > at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253) >
[jira] [Commented] (SPARK-35324) Time parser policy not respected in RDD
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339939#comment-17339939 ] Willi Raschkowski commented on SPARK-35324: --- This also reproduces if a launch the shell with {{--conf "spark.sql.legacy.timeParserPolicy=legacy"}}; just to prove that this isn't because I set the config via {{spark.conf.set}}. > Time parser policy not respected in RDD > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot > 2021-05-06 at 00.35.10.png > > > When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in > actions on the resulting dataframe. But it's ignored in actions on > dataframe's RDD. > E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to > be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as > you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the > override and read the config value as {{EXCEPTION}}. > For instance: > {code:java|title=test.csv} > date > 2/6/18 > {code} > {code:java} > scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") > scala> val df = { > | spark.read > | .option("header", "true") > | .option("dateFormat", "MM/dd/yy") > | .schema("date date") > | .csv("/Users/wraschkowski/Downloads/test.csv") > | } > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > +--+ > > | date| > +--+ > |2018-02-06| > +--+ > scala> df.count > res3: Long = 1 > scala> df.rdd.count > 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can > set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior > before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime > string. > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) > at > scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396) > at > org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866) > at org
[jira] [Commented] (SPARK-35324) Time parser policy not respected in RDD
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339938#comment-17339938 ] Willi Raschkowski commented on SPARK-35324: --- I think the difference might have to do with the fact that in the RDD case the config isn't in the local properties of the {{TaskContext}}. * Stepping through the debugger, I see that both RDD and Dataset decide on using or not using the legacy date formatter in [{{DateFormatter.getFormatter}}|https://github.com/apache/spark/blob/4fe4b65d9e4017654c93c8f7957ae3edbd270d0b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L161]. * Then in [{{SQLConf.get}}|https://github.com/apache/spark/blob/4fe4b65d9e4017654c93c8f7957ae3edbd270d0b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L172], both cases find a {{TaskContext}} and no {{existingConf}}. So they create a new {{ReadOnlySQLConf}} from the {{TaskContext}} object. * RDD and Dataset code path differ in the local properties they find on the {{TaskContext}} [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/ReadOnlySQLConf.scala#L32]. The Dataset code path has {{spark.sql.legacy.timeParserPolicy}} in the local properties, but the RDD path doesn't. The {{ReadOnlySQLConf}} is created from the local properties, so in the RDD path the resulting config object doesn't have an override for {{spark.sql.legacy.timeParserPolicy}}. Just to show you what I see in the debugger. In both cases we stopped [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/ReadOnlySQLConf.scala#L32]. !Screen Shot 2021-05-06 at 00.35.10.png|width=300! !Screen Shot 2021-05-06 at 00.33.10.png|width=300! > Time parser policy not respected in RDD > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot > 2021-05-06 at 00.35.10.png > > > When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in > actions on the resulting dataframe. But it's ignored in actions on > dataframe's RDD. > E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to > be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as > you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the > override and read the config value as {{EXCEPTION}}. > For instance: > {code:java|title=test.csv} > date > 2/6/18 > {code} > {code:java} > scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") > scala> val df = { > | spark.read > | .option("header", "true") > | .option("dateFormat", "MM/dd/yy") > | .schema("date date") > | .csv("/Users/wraschkowski/Downloads/test.csv") > | } > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > +--+ > > | date| > +--+ > |2018-02-06| > +--+ > scala> df.count > res3: Long = 1 > scala> df.rdd.count > 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can > set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior > before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime > string. > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) > at > scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) >
[jira] [Updated] (SPARK-35324) Time parser policy not respected in RDD
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-35324: -- Attachment: Screen Shot 2021-05-06 at 00.35.10.png Screen Shot 2021-05-06 at 00.33.10.png > Time parser policy not respected in RDD > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Shot 2021-05-06 at 00.33.10.png, Screen Shot > 2021-05-06 at 00.35.10.png > > > When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in > actions on the resulting dataframe. But it's ignored in actions on > dataframe's RDD. > E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to > be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as > you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the > override and read the config value as {{EXCEPTION}}. > For instance: > {code:java|title=test.csv} > date > 2/6/18 > {code} > {code:java} > scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") > scala> val df = { > | spark.read > | .option("header", "true") > | .option("dateFormat", "MM/dd/yy") > | .schema("date date") > | .csv("/Users/wraschkowski/Downloads/test.csv") > | } > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > +--+ > > | date| > +--+ > |2018-02-06| > +--+ > scala> df.count > res3: Long = 1 > scala> df.rdd.count > 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can > set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior > before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime > string. > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) > at > scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396) > at > org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866) > at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253) > at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)
[jira] [Updated] (SPARK-35324) Time parser policy not respected in RDD
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-35324: -- Description: When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in actions on the resulting dataframe. But it's ignored in actions on dataframe's RDD. E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to be set to {{LEGACY}}. If you set the config, {{df.collect}} will work as you'd expect. However, {{df.collect.rdd}} will fail because it'll ignore the override and read the config value as {{EXCEPTION}}. For instance: {code:java|title=test.csv} date 2/6/18 {code} {code:java} scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") scala> val df = { | spark.read | .option("header", "true") | .option("dateFormat", "MM/dd/yy") | .schema("date date") | .csv("/Users/wraschkowski/Downloads/test.csv") | } df: org.apache.spark.sql.DataFrame = [date: date] scala> df.show +--+ | date| +--+ |2018-02-06| +--+ scala> df.count res3: Long = 1 scala> df.rdd.count 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) at org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291) at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254) at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396) at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1866) at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253) at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/jav
[jira] [Comment Edited] (SPARK-35324) Time parser policy not respected in RDD
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339933#comment-17339933 ] Willi Raschkowski edited comment on SPARK-35324 at 5/5/21, 11:26 PM: - For what it's worth, I only managed to reproduce with a reader. Creating a dataframe from a {{Seq}} works fine: {code:java} scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", "MM/dd/yy")).collect res7: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06]) scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", "MM/dd/yy")).rdd.collect res8: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06]){code} was (Author: raschkowski): For what it's worth, I only managed to reproduce with a reader: {code:java} scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", "MM/dd/yy")).collect res7: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06]) scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", "MM/dd/yy")).rdd.collect res8: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06]){code} > Time parser policy not respected in RDD > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Willi Raschkowski >Priority: Major > > When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in > actions on the resulting dataframe. But it's ignored in actions on > dataframe's RDD. > E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to > be set to {{LEGACY}}. If you set the config, {{df.count}} will work as you'd > expect. However, {{df.count.rdd}} will fail because it'll ignore the override > and read the config value as {{EXCEPTION}}. > For instance: > {code:java|title=test.csv} > date > 2/6/18 > {code} > {code:java} > scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") > scala> val df = { > | spark.read > | .option("header", "true") > | .option("dateFormat", "MM/dd/yy") > | .schema("date date") > | .csv("/Users/wraschkowski/Downloads/test.csv") > | } > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > +--+ > > | date| > +--+ > |2018-02-06| > +--+ > scala> df.count > res3: Long = 1 > scala> df.rdd.count > 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can > set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior > before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime > string. > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) > at > scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396) > at > org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at
[jira] [Commented] (SPARK-35324) Time parser policy not respected in RDD
[ https://issues.apache.org/jira/browse/SPARK-35324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339933#comment-17339933 ] Willi Raschkowski commented on SPARK-35324: --- For what it's worth, I only managed to reproduce with a reader: {code:java} scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", "MM/dd/yy")).collect res7: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06]) scala> Seq("2/6/18").toDF.withColumn("parsed", to_date($"value", "MM/dd/yy")).rdd.collect res8: Array[org.apache.spark.sql.Row] = Array([2/6/18,2018-02-06]){code} > Time parser policy not respected in RDD > --- > > Key: SPARK-35324 > URL: https://issues.apache.org/jira/browse/SPARK-35324 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Willi Raschkowski >Priority: Major > > When reading a CSV file, {{spark.sql.timeParserPolicy}} is respected in > actions on the resulting dataframe. But it's ignored in actions on > dataframe's RDD. > E.g. say to parse dates in a CSV you need {{spark.sql.timeParserPolicy}} to > be set to {{LEGACY}}. If you set the config, {{df.count}} will work as you'd > expect. However, {{df.count.rdd}} will fail because it'll ignore the override > and read the config value as {{EXCEPTION}}. > For instance: > {code:java|title=test.csv} > date > 2/6/18 > {code} > {code:java} > scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "legacy") > scala> val df = { > | spark.read > | .option("header", "true") > | .option("dateFormat", "MM/dd/yy") > | .schema("date date") > | .csv("/Users/wraschkowski/Downloads/test.csv") > | } > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > +--+ > > | date| > +--+ > |2018-02-06| > +--+ > scala> df.count > res3: Long = 1 > scala> df.rdd.count > 21/05/06 00:06:18 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) > org.apache.spark.SparkUpgradeException: You may get a different result due to > the upgrading of Spark 3.0: Fail to parse '2/6/18' in the new parser. You can > set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior > before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime > string. > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150) > at > org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.$anonfun$parse$1(DateFormatter.scala:61) > at > scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.util.Iso8601DateFormatter.parse(DateFormatter.scala:58) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:202) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:238) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:200) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:291) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:254) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:396) > at > org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:400) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at scala.collection.Iterator$$anon$10.hasNext(Iterato