[jira] [Updated] (SPARK-46747) Too Many Shared Locks due to PostgresDialect.getTableExistsQuery - LIMIT 1
[ https://issues.apache.org/jira/browse/SPARK-46747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46747: --- Labels: pull-request-available (was: ) > Too Many Shared Locks due to PostgresDialect.getTableExistsQuery - LIMIT 1 > -- > > Key: SPARK-46747 > URL: https://issues.apache.org/jira/browse/SPARK-46747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, > 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.3, > 3.4.2, 3.3.2, 3.4.0, 3.4.1, 3.5.0, 3.3.4 >Reporter: Bala Bellam >Priority: Critical > Labels: pull-request-available > > +*Background:*+ > PostgresDialect.getTableExistsQuery is using LIMIT 1 query to check the table > existence in the database by overriding the default > JdbcDialect.getTableExistsQuery which has WHERE 1 = 0. > +*Issue:*+ > Due to LIMIT 1 query pattern, we are seeing high number of shared locks in > the PostgreSQL installations where there are many partitions under a table > that's being written to. Hence resorting to the default JdbcDialect which > does WHERE 1 = 0 is proven to be more optimal as it doesn't scan any of the > partitions and effectively checks for table existence. > The SELECT 1 FROM table LIMIT 1 query can indeed be heavier in certain > scenarios, especially with partitioned tables or tables with a lot of data, > as it may take shared locks on all partitions or involve more planner and > execution time to determine the quickest way to get a single row. > On the other hand, SELECT 1 FROM table WHERE 1=0 doesn't actually try to read > any data due to the always false WHERE condition. This makes it a lighter > operation, as it typically only involves checking the table's metadata to > validate the table's existence without taking locks on the table's data or > partitions. > So, considering performance and minimizing locks, SELECT 1 FROM table WHERE > 1=0 would be a better choice if we're strictly looking to check for a table's > existence and want to avoid potentially heavier operations like taking shared > locks on partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46905) Add dedicated class to keep column definition instead of StructField in Create/ReplaceTable command
[ https://issues.apache.org/jira/browse/SPARK-46905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46905. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44935 [https://github.com/apache/spark/pull/44935] > Add dedicated class to keep column definition instead of StructField in > Create/ReplaceTable command > --- > > Key: SPARK-46905 > URL: https://issues.apache.org/jira/browse/SPARK-46905 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46905) Add dedicated class to keep column definition instead of StructField in Create/ReplaceTable command
[ https://issues.apache.org/jira/browse/SPARK-46905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46905: --- Assignee: Wenchen Fan > Add dedicated class to keep column definition instead of StructField in > Create/ReplaceTable command > --- > > Key: SPARK-46905 > URL: https://issues.apache.org/jira/browse/SPARK-46905 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46893. --- Fix Version/s: 3.4.3 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 44933 [https://github.com/apache/spark/pull/44933] > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Assignee: Willi Raschkowski >Priority: Major > Labels: pull-request-available > Fix For: 3.4.3, 3.5.1, 4.0.0 > > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g.,
[jira] [Assigned] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46893: - Assignee: Willi Raschkowski > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Assignee: Willi Raschkowski >Priority: Major > Labels: pull-request-available > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g.,
[jira] [Commented] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812165#comment-17812165 ] Dongjoon Hyun commented on SPARK-46893: --- Thank you for pinging me, [~rshkv]. > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Labels: pull-request-available > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g.,
[jira] [Updated] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows
[ https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46876: --- Labels: pull-request-available (was: ) > Data is silently lost in Tab separated CSV with empty (whitespace) rows > --- > > Key: SPARK-46876 > URL: https://issues.apache.org/jira/browse/SPARK-46876 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > Labels: pull-request-available > > When reading a tab separated file that contains lines that only contain tabs > (i.e. empty strings as values of the columns for that row), then these rows > will silently be skipped (as empty lines) and the resulting dataframe will > have less rows than expected. > This behavior is inconsistent with the behavior for e.g. semicolon separated > files, where the resulting dataframe will have a row with only empty string > values. > A minimal reproducible example would look like: > A minimal reproducible example: A file containing this > {code:java} > a\tb\tc\r\n > \t\t\r\n > 1\t2\t3{code} > will create a dataframe with one row (a=1,b=2,c=3) > whereas this > {code:java} > a;b;c\r\n > ;;\r\n > 1;2;3{code} > will read as two rows (first row contains empty strings) > I used the following pyspark command to read the dataframes > {code:java} > spark.read.option("header","true").option("sep","\t").csv(" file>").collect() > spark.read.option("header","true").option("sep",";").csv(" file>").collect() > {code} > I ran into this particularly on databricks (I assume they use the same > reader), but [this stack overflow > post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858] > indicates, that this is an old issue that may have been taken over from > databricks when their csv reader was adopted in SPARK-12420 > I recommend to at least add a respective test case to the CSV reader. > > Why is this behaviour a problem: > * It violates some of the core assumptions > ** a properly configured roundtrip via csv write/read should result in the > same set of rows > ** changing the csv separator (when everything is properly esacped) should > have no effect > Potential resolutions: > * When the configured delimiter consists of only whitespace > ** deactivate the "skip empty line feature" > ** or skip only lines that are completely empty (only a (carriage return) > newline) > * Change the skip empty line feature to only skip if the line is completely > empty (only contains a newlin) > ** this may break some user code that relies on the current behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46914) Shorten app name in the summary table on the History Page
[ https://issues.apache.org/jira/browse/SPARK-46914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46914. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44944 [https://github.com/apache/spark/pull/44944] > Shorten app name in the summary table on the History Page > -- > > Key: SPARK-46914 > URL: https://issues.apache.org/jira/browse/SPARK-46914 > Project: Spark > Issue Type: Improvement > Components: UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46914) Shorten app name in the summary table on the History Page
[ https://issues.apache.org/jira/browse/SPARK-46914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46914: - Assignee: Kent Yao > Shorten app name in the summary table on the History Page > -- > > Key: SPARK-46914 > URL: https://issues.apache.org/jira/browse/SPARK-46914 > Project: Spark > Issue Type: Improvement > Components: UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46916) Clean up the imports in pyspark.pandas.tests.indexes.*
[ https://issues.apache.org/jira/browse/SPARK-46916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46916. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44945 [https://github.com/apache/spark/pull/44945] > Clean up the imports in pyspark.pandas.tests.indexes.* > -- > > Key: SPARK-46916 > URL: https://issues.apache.org/jira/browse/SPARK-46916 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46916) Clean up the imports in pyspark.pandas.tests.indexes.*
[ https://issues.apache.org/jira/browse/SPARK-46916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46916: --- Labels: pull-request-available (was: ) > Clean up the imports in pyspark.pandas.tests.indexes.* > -- > > Key: SPARK-46916 > URL: https://issues.apache.org/jira/browse/SPARK-46916 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46916) Clean up the imports in pyspark.pandas.tests.indexes.*
[ https://issues.apache.org/jira/browse/SPARK-46916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-46916: -- Summary: Clean up the imports in pyspark.pandas.tests.indexes.* (was: [SPARK-46896][PS][TESTS] Clean up the imports in pyspark.pandas.tests.indexes.*) > Clean up the imports in pyspark.pandas.tests.indexes.* > -- > > Key: SPARK-46916 > URL: https://issues.apache.org/jira/browse/SPARK-46916 > Project: Spark > Issue Type: Sub-task > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46914) Shorten app name in the summary table on the History Page
[ https://issues.apache.org/jira/browse/SPARK-46914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-46914: - Priority: Minor (was: Major) > Shorten app name in the summary table on the History Page > -- > > Key: SPARK-46914 > URL: https://issues.apache.org/jira/browse/SPARK-46914 > Project: Spark > Issue Type: Improvement > Components: UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46747) Too Many Shared Locks due to PostgresDialect.getTableExistsQuery - LIMIT 1
[ https://issues.apache.org/jira/browse/SPARK-46747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812135#comment-17812135 ] Kent Yao commented on SPARK-46747: -- It would be better if you could provide the stats of # of shared locks before and after. > Too Many Shared Locks due to PostgresDialect.getTableExistsQuery - LIMIT 1 > -- > > Key: SPARK-46747 > URL: https://issues.apache.org/jira/browse/SPARK-46747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, > 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.3, > 3.4.2, 3.3.2, 3.4.0, 3.4.1, 3.5.0, 3.3.4 >Reporter: Bala Bellam >Priority: Critical > > +*Background:*+ > PostgresDialect.getTableExistsQuery is using LIMIT 1 query to check the table > existence in the database by overriding the default > JdbcDialect.getTableExistsQuery which has WHERE 1 = 0. > +*Issue:*+ > Due to LIMIT 1 query pattern, we are seeing high number of shared locks in > the PostgreSQL installations where there are many partitions under a table > that's being written to. Hence resorting to the default JdbcDialect which > does WHERE 1 = 0 is proven to be more optimal as it doesn't scan any of the > partitions and effectively checks for table existence. > The SELECT 1 FROM table LIMIT 1 query can indeed be heavier in certain > scenarios, especially with partitioned tables or tables with a lot of data, > as it may take shared locks on all partitions or involve more planner and > execution time to determine the quickest way to get a single row. > On the other hand, SELECT 1 FROM table WHERE 1=0 doesn't actually try to read > any data due to the always false WHERE condition. This makes it a lighter > operation, as it typically only involves checking the table's metadata to > validate the table's existence without taking locks on the table's data or > partitions. > So, considering performance and minimizing locks, SELECT 1 FROM table WHERE > 1=0 would be a better choice if we're strictly looking to check for a table's > existence and want to avoid potentially heavier operations like taking shared > locks on partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46914) Shorten app name in the summary table on the History Page
[ https://issues.apache.org/jira/browse/SPARK-46914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46914: --- Labels: pull-request-available (was: ) > Shorten app name in the summary table on the History Page > -- > > Key: SPARK-46914 > URL: https://issues.apache.org/jira/browse/SPARK-46914 > Project: Spark > Issue Type: Improvement > Components: UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46912) Spark-submit in cluster mode with standalone cluster uses wrong JAVA_HOME path
[ https://issues.apache.org/jira/browse/SPARK-46912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46912: --- Labels: pull-request-available (was: ) > Spark-submit in cluster mode with standalone cluster uses wrong JAVA_HOME path > -- > > Key: SPARK-46912 > URL: https://issues.apache.org/jira/browse/SPARK-46912 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 3.5.0 >Reporter: Danh Pham >Priority: Major > Labels: pull-request-available > > When run spark submit to a standalone cluster using cluster mode, the worker > machine will use the JAVA_HOME value from remote machine instead of from > worker machine. > To reproduce: > * Create a standalone cluster using docker compose, set JAVA_HOME in each > worker different from local machine. > * Run spark-submit, deploy-mode cluster > * Monitor the log from worker, the driver will print out: DriverRunner: > Launch Command: "" "-cp" ... > Reason: > When Master create a new driver in receiveAndReply method, it uses the > environment variables from submitter to build the driver description command. > After that, when launch the driver, a new local (of worker) is built but it > still use environment variable from driver description (which came from > submitter). The result is the building java command will use the submitter > java home path instead of worker path. > Suggestion: > Replace JAVA_HOME and SPARK_HOME in buildLocalCommand method of > org.apache.spark.deploy.worker.CommandUtils by worker value -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46915) Simplify `UnaryMinus` and align error class
[ https://issues.apache.org/jira/browse/SPARK-46915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46915: --- Labels: pull-request-available (was: ) > Simplify `UnaryMinus` and align error class > --- > > Key: SPARK-46915 > URL: https://issues.apache.org/jira/browse/SPARK-46915 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46915) Simplify `UnaryMinus` and align error class
BingKun Pan created SPARK-46915: --- Summary: Simplify `UnaryMinus` and align error class Key: SPARK-46915 URL: https://issues.apache.org/jira/browse/SPARK-46915 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46914) Shorten app name in the summary table on the History Page
Kent Yao created SPARK-46914: Summary: Shorten app name in the summary table on the History Page Key: SPARK-46914 URL: https://issues.apache.org/jira/browse/SPARK-46914 Project: Spark Issue Type: Improvement Components: UI Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows
[ https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812098#comment-17812098 ] Jie Han edited comment on SPARK-46876 at 1/30/24 3:01 AM: -- {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which only contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. I've learnt that apache/commons-csv does trim for every column instead of whole line before parsing and trim is an option.}} was (Author: JIRAUSER285788): {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which only contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. I've learnt that apache/commons-csv does trim for every column instead of whole line before parsing and trim is an option.}} > Data is silently lost in Tab separated CSV with empty (whitespace) rows > --- > > Key: SPARK-46876 > URL: https://issues.apache.org/jira/browse/SPARK-46876 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > > When reading a tab separated file that contains lines that only contain tabs > (i.e. empty strings as values of the columns for that row), then these rows > will silently be skipped (as empty lines) and the resulting dataframe will > have less rows than expected. > This behavior is inconsistent with the behavior for e.g. semicolon separated > files, where the resulting dataframe will have a row with only empty string > values. > A minimal reproducible example would look like: > A minimal reproducible example: A file containing this > {code:java} > a\tb\tc\r\n > \t\t\r\n > 1\t2\t3{code} > will create a dataframe with one row (a=1,b=2,c=3) > whereas this > {code:java} > a;b;c\r\n > ;;\r\n > 1;2;3{code} > will read as two rows (first row contains empty strings) > I used the following pyspark command to read the dataframes > {code:java} > spark.read.option("header","true").option("sep","\t").csv(" file>").collect() > spark.read.option("header","true").option("sep",";").csv(" file>").collect() > {code} > I ran into this particularly on databricks (I assume they use the same > reader), but [this stack overflow > post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858] > indicates, that this is an old issue that may have been taken over from > databricks when their csv reader was adopted in SPARK-12420 > I recommend to at least add a respective test case to the CSV reader. > > Why is this behaviour a problem: > * It violates some of the core assumptions > ** a properly configured roundtrip via csv write/read should result in the > same set of rows > ** changing the csv separator (when everything is properly esacped) should > have no effect > Potential resolutions: > * When the configured delimiter consists of only whitespace > ** deactivate the "skip empty line feature" > ** or skip only lines that are completely empty (only a (carriage return) > newline) > * Change the skip empty line feature to only skip if the line is completely > empty (only contains a newlin) > ** this may break some user code that relies on the current behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows
[ https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812098#comment-17812098 ] Jie Han edited comment on SPARK-46876 at 1/30/24 3:00 AM: -- {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which only contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. I've learnt that apache/commons-csv does trim for every column instead of whole line before parsing and trim is an option.}} was (Author: JIRAUSER285788): {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which only contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. I've learnt that apache/commons-csv does trim for every column instead of whole line before parsing.}} > Data is silently lost in Tab separated CSV with empty (whitespace) rows > --- > > Key: SPARK-46876 > URL: https://issues.apache.org/jira/browse/SPARK-46876 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > > When reading a tab separated file that contains lines that only contain tabs > (i.e. empty strings as values of the columns for that row), then these rows > will silently be skipped (as empty lines) and the resulting dataframe will > have less rows than expected. > This behavior is inconsistent with the behavior for e.g. semicolon separated > files, where the resulting dataframe will have a row with only empty string > values. > A minimal reproducible example would look like: > A minimal reproducible example: A file containing this > {code:java} > a\tb\tc\r\n > \t\t\r\n > 1\t2\t3{code} > will create a dataframe with one row (a=1,b=2,c=3) > whereas this > {code:java} > a;b;c\r\n > ;;\r\n > 1;2;3{code} > will read as two rows (first row contains empty strings) > I used the following pyspark command to read the dataframes > {code:java} > spark.read.option("header","true").option("sep","\t").csv(" file>").collect() > spark.read.option("header","true").option("sep",";").csv(" file>").collect() > {code} > I ran into this particularly on databricks (I assume they use the same > reader), but [this stack overflow > post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858] > indicates, that this is an old issue that may have been taken over from > databricks when their csv reader was adopted in SPARK-12420 > I recommend to at least add a respective test case to the CSV reader. > > Why is this behaviour a problem: > * It violates some of the core assumptions > ** a properly configured roundtrip via csv write/read should result in the > same set of rows > ** changing the csv separator (when everything is properly esacped) should > have no effect > Potential resolutions: > * When the configured delimiter consists of only whitespace > ** deactivate the "skip empty line feature" > ** or skip only lines that are completely empty (only a (carriage return) > newline) > * Change the skip empty line feature to only skip if the line is completely > empty (only contains a newlin) > ** this may break some user code that relies on the current behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows
[ https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812098#comment-17812098 ] Jie Han edited comment on SPARK-46876 at 1/30/24 3:00 AM: -- {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which only contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. I've learnt that apache/commons-csv does trim for every column instead of whole line before parsing.}} was (Author: JIRAUSER285788): {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which only contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. }} > Data is silently lost in Tab separated CSV with empty (whitespace) rows > --- > > Key: SPARK-46876 > URL: https://issues.apache.org/jira/browse/SPARK-46876 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > > When reading a tab separated file that contains lines that only contain tabs > (i.e. empty strings as values of the columns for that row), then these rows > will silently be skipped (as empty lines) and the resulting dataframe will > have less rows than expected. > This behavior is inconsistent with the behavior for e.g. semicolon separated > files, where the resulting dataframe will have a row with only empty string > values. > A minimal reproducible example would look like: > A minimal reproducible example: A file containing this > {code:java} > a\tb\tc\r\n > \t\t\r\n > 1\t2\t3{code} > will create a dataframe with one row (a=1,b=2,c=3) > whereas this > {code:java} > a;b;c\r\n > ;;\r\n > 1;2;3{code} > will read as two rows (first row contains empty strings) > I used the following pyspark command to read the dataframes > {code:java} > spark.read.option("header","true").option("sep","\t").csv(" file>").collect() > spark.read.option("header","true").option("sep",";").csv(" file>").collect() > {code} > I ran into this particularly on databricks (I assume they use the same > reader), but [this stack overflow > post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858] > indicates, that this is an old issue that may have been taken over from > databricks when their csv reader was adopted in SPARK-12420 > I recommend to at least add a respective test case to the CSV reader. > > Why is this behaviour a problem: > * It violates some of the core assumptions > ** a properly configured roundtrip via csv write/read should result in the > same set of rows > ** changing the csv separator (when everything is properly esacped) should > have no effect > Potential resolutions: > * When the configured delimiter consists of only whitespace > ** deactivate the "skip empty line feature" > ** or skip only lines that are completely empty (only a (carriage return) > newline) > * Change the skip empty line feature to only skip if the line is completely > empty (only contains a newlin) > ** this may break some user code that relies on the current behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46912) Spark-submit in cluster mode with standalone cluster uses wrong JAVA_HOME path
Danh Pham created SPARK-46912: - Summary: Spark-submit in cluster mode with standalone cluster uses wrong JAVA_HOME path Key: SPARK-46912 URL: https://issues.apache.org/jira/browse/SPARK-46912 Project: Spark Issue Type: Bug Components: Spark Core, Spark Submit Affects Versions: 3.5.0 Reporter: Danh Pham When run spark submit to a standalone cluster using cluster mode, the worker machine will use the JAVA_HOME value from remote machine instead of from worker machine. To reproduce: * Create a standalone cluster using docker compose, set JAVA_HOME in each worker different from local machine. * Run spark-submit, deploy-mode cluster * Monitor the log from worker, the driver will print out: DriverRunner: Launch Command: "" "-cp" ... Reason: When Master create a new driver in receiveAndReply method, it uses the environment variables from submitter to build the driver description command. After that, when launch the driver, a new local (of worker) is built but it still use environment variable from driver description (which came from submitter). The result is the building java command will use the submitter java home path instead of worker path. Suggestion: Replace JAVA_HOME and SPARK_HOME in buildLocalCommand method of org.apache.spark.deploy.worker.CommandUtils by worker value -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows
[ https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812098#comment-17812098 ] Jie Han edited comment on SPARK-46876 at 1/30/24 2:26 AM: -- {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which only contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. }} was (Author: JIRAUSER285788): {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. }} > Data is silently lost in Tab separated CSV with empty (whitespace) rows > --- > > Key: SPARK-46876 > URL: https://issues.apache.org/jira/browse/SPARK-46876 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > > When reading a tab separated file that contains lines that only contain tabs > (i.e. empty strings as values of the columns for that row), then these rows > will silently be skipped (as empty lines) and the resulting dataframe will > have less rows than expected. > This behavior is inconsistent with the behavior for e.g. semicolon separated > files, where the resulting dataframe will have a row with only empty string > values. > A minimal reproducible example would look like: > A minimal reproducible example: A file containing this > {code:java} > a\tb\tc\r\n > \t\t\r\n > 1\t2\t3{code} > will create a dataframe with one row (a=1,b=2,c=3) > whereas this > {code:java} > a;b;c\r\n > ;;\r\n > 1;2;3{code} > will read as two rows (first row contains empty strings) > I used the following pyspark command to read the dataframes > {code:java} > spark.read.option("header","true").option("sep","\t").csv(" file>").collect() > spark.read.option("header","true").option("sep",";").csv(" file>").collect() > {code} > I ran into this particularly on databricks (I assume they use the same > reader), but [this stack overflow > post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858] > indicates, that this is an old issue that may have been taken over from > databricks when their csv reader was adopted in SPARK-12420 > I recommend to at least add a respective test case to the CSV reader. > > Why is this behaviour a problem: > * It violates some of the core assumptions > ** a properly configured roundtrip via csv write/read should result in the > same set of rows > ** changing the csv separator (when everything is properly esacped) should > have no effect > Potential resolutions: > * When the configured delimiter consists of only whitespace > ** deactivate the "skip empty line feature" > ** or skip only lines that are completely empty (only a (carriage return) > newline) > * Change the skip empty line feature to only skip if the line is completely > empty (only contains a newlin) > ** this may break some user code that relies on the current behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46736) Retain empty protobuf message in schema for rpotobuf connector
[ https://issues.apache.org/jira/browse/SPARK-46736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-46736. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44643 [https://github.com/apache/spark/pull/44643] > Retain empty protobuf message in schema for rpotobuf connector > -- > > Key: SPARK-46736 > URL: https://issues.apache.org/jira/browse/SPARK-46736 > Project: Spark > Issue Type: Improvement > Components: Protobuf >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Since Spark doesn't allow empty StructType, empty proto message type as field > will be dropped by default. Introduce an option to allow retaining an empty > message field by inserting a dummy column. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46736) Retain empty protobuf message in schema for rpotobuf connector
[ https://issues.apache.org/jira/browse/SPARK-46736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-46736: Assignee: Chaoqin Li > Retain empty protobuf message in schema for rpotobuf connector > -- > > Key: SPARK-46736 > URL: https://issues.apache.org/jira/browse/SPARK-46736 > Project: Spark > Issue Type: Improvement > Components: Protobuf >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > > Since Spark doesn't allow empty StructType, empty proto message type as field > will be dropped by default. Introduce an option to allow retaining an empty > message field by inserting a dummy column. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows
[ https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812098#comment-17812098 ] Jie Han commented on SPARK-46876: - {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be the exactly data itself. }} > Data is silently lost in Tab separated CSV with empty (whitespace) rows > --- > > Key: SPARK-46876 > URL: https://issues.apache.org/jira/browse/SPARK-46876 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > > When reading a tab separated file that contains lines that only contain tabs > (i.e. empty strings as values of the columns for that row), then these rows > will silently be skipped (as empty lines) and the resulting dataframe will > have less rows than expected. > This behavior is inconsistent with the behavior for e.g. semicolon separated > files, where the resulting dataframe will have a row with only empty string > values. > A minimal reproducible example would look like: > A minimal reproducible example: A file containing this > {code:java} > a\tb\tc\r\n > \t\t\r\n > 1\t2\t3{code} > will create a dataframe with one row (a=1,b=2,c=3) > whereas this > {code:java} > a;b;c\r\n > ;;\r\n > 1;2;3{code} > will read as two rows (first row contains empty strings) > I used the following pyspark command to read the dataframes > {code:java} > spark.read.option("header","true").option("sep","\t").csv(" file>").collect() > spark.read.option("header","true").option("sep",";").csv(" file>").collect() > {code} > I ran into this particularly on databricks (I assume they use the same > reader), but [this stack overflow > post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858] > indicates, that this is an old issue that may have been taken over from > databricks when their csv reader was adopted in SPARK-12420 > I recommend to at least add a respective test case to the CSV reader. > > Why is this behaviour a problem: > * It violates some of the core assumptions > ** a properly configured roundtrip via csv write/read should result in the > same set of rows > ** changing the csv separator (when everything is properly esacped) should > have no effect > Potential resolutions: > * When the configured delimiter consists of only whitespace > ** deactivate the "skip empty line feature" > ** or skip only lines that are completely empty (only a (carriage return) > newline) > * Change the skip empty line feature to only skip if the line is completely > empty (only contains a newlin) > ** this may break some user code that relies on the current behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows
[ https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812098#comment-17812098 ] Jie Han edited comment on SPARK-46876 at 1/30/24 1:24 AM: -- {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be exactly data itself. }} was (Author: JIRAUSER285788): {{The reason is that before parsing the csv lines spark calls `CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which contains characters those <= ' '. I doubt that if it's neccessary to do this, because they may be the exactly data itself. }} > Data is silently lost in Tab separated CSV with empty (whitespace) rows > --- > > Key: SPARK-46876 > URL: https://issues.apache.org/jira/browse/SPARK-46876 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > > When reading a tab separated file that contains lines that only contain tabs > (i.e. empty strings as values of the columns for that row), then these rows > will silently be skipped (as empty lines) and the resulting dataframe will > have less rows than expected. > This behavior is inconsistent with the behavior for e.g. semicolon separated > files, where the resulting dataframe will have a row with only empty string > values. > A minimal reproducible example would look like: > A minimal reproducible example: A file containing this > {code:java} > a\tb\tc\r\n > \t\t\r\n > 1\t2\t3{code} > will create a dataframe with one row (a=1,b=2,c=3) > whereas this > {code:java} > a;b;c\r\n > ;;\r\n > 1;2;3{code} > will read as two rows (first row contains empty strings) > I used the following pyspark command to read the dataframes > {code:java} > spark.read.option("header","true").option("sep","\t").csv(" file>").collect() > spark.read.option("header","true").option("sep",";").csv(" file>").collect() > {code} > I ran into this particularly on databricks (I assume they use the same > reader), but [this stack overflow > post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858] > indicates, that this is an old issue that may have been taken over from > databricks when their csv reader was adopted in SPARK-12420 > I recommend to at least add a respective test case to the CSV reader. > > Why is this behaviour a problem: > * It violates some of the core assumptions > ** a properly configured roundtrip via csv write/read should result in the > same set of rows > ** changing the csv separator (when everything is properly esacped) should > have no effect > Potential resolutions: > * When the configured delimiter consists of only whitespace > ** deactivate the "skip empty line feature" > ** or skip only lines that are completely empty (only a (carriage return) > newline) > * Change the skip empty line feature to only skip if the line is completely > empty (only contains a newlin) > ** this may break some user code that relies on the current behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46910) Eliminate JDK Requirement in PySpark Installation
[ https://issues.apache.org/jira/browse/SPARK-46910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46910. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44940 [https://github.com/apache/spark/pull/44940] > Eliminate JDK Requirement in PySpark Installation > - > > Key: SPARK-46910 > URL: https://issues.apache.org/jira/browse/SPARK-46910 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > PySpark requires users to have the correct JDK version (JDK 8+ for Spark<4; > JDK 17+ for Spark>=4) installed locally. > We can make the Spark installation script install the JDK, so users don’t > need to do this step manually. > h1. Details > # When the entry point for a Spark class is invoked, the spark-class script > checks if Java is installed in the user environment. > # If Java is not installed, the user is prompted to select whether they want > to install JDK 17. > # If the user selects yes, JDK 17 is installed (using the [install-jdk > library|https://pypi.org/project/install-jdk/]) and JAVA_HOME variable and > RUNNER are set appropriately. The Spark build will now work! > # If the user selects no, we provide them a brief description of how to > install JDK manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46910) Eliminate JDK Requirement in PySpark Installation
[ https://issues.apache.org/jira/browse/SPARK-46910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46910: Assignee: Amanda Liu > Eliminate JDK Requirement in PySpark Installation > - > > Key: SPARK-46910 > URL: https://issues.apache.org/jira/browse/SPARK-46910 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Minor > Labels: pull-request-available > > PySpark requires users to have the correct JDK version (JDK 8+ for Spark<4; > JDK 17+ for Spark>=4) installed locally. > We can make the Spark installation script install the JDK, so users don’t > need to do this step manually. > h1. Details > # When the entry point for a Spark class is invoked, the spark-class script > checks if Java is installed in the user environment. > # If Java is not installed, the user is prompted to select whether they want > to install JDK 17. > # If the user selects yes, JDK 17 is installed (using the [install-jdk > library|https://pypi.org/project/install-jdk/]) and JAVA_HOME variable and > RUNNER are set appropriately. The Spark build will now work! > # If the user selects no, we provide them a brief description of how to > install JDK manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46911) Add deleteIfExists operator to StatefulProcessorHandle
Eric Marnadi created SPARK-46911: Summary: Add deleteIfExists operator to StatefulProcessorHandle Key: SPARK-46911 URL: https://issues.apache.org/jira/browse/SPARK-46911 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Eric Marnadi Adding the {{deleteIfExists}} method to the {{StatefulProcessorHandle}} in order to remove state variables from the State Store -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46910) Eliminate JDK Requirement in PySpark Installation
[ https://issues.apache.org/jira/browse/SPARK-46910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46910: --- Labels: pull-request-available (was: ) > Eliminate JDK Requirement in PySpark Installation > - > > Key: SPARK-46910 > URL: https://issues.apache.org/jira/browse/SPARK-46910 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Minor > Labels: pull-request-available > > PySpark requires users to have the correct JDK version (JDK 8+ for Spark<4; > JDK 17+ for Spark>=4) installed locally. > We can make the Spark installation script install the JDK, so users don’t > need to do this step manually. > h1. Details > # When the entry point for a Spark class is invoked, the spark-class script > checks if Java is installed in the user environment. > # If Java is not installed, the user is prompted to select whether they want > to install JDK 17. > # If the user selects yes, JDK 17 is installed (using the [install-jdk > library|https://pypi.org/project/install-jdk/]) and JAVA_HOME variable and > RUNNER are set appropriately. The Spark build will now work! > # If the user selects no, we provide them a brief description of how to > install JDK manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46910) Eliminate JDK Requirement in PySpark Installation
Amanda Liu created SPARK-46910: -- Summary: Eliminate JDK Requirement in PySpark Installation Key: SPARK-46910 URL: https://issues.apache.org/jira/browse/SPARK-46910 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu PySpark requires users to have the correct JDK version (JDK 8+ for Spark<4; JDK 17+ for Spark>=4) installed locally. We can make the Spark installation script install the JDK, so users don’t need to do this step manually. h1. Details # When the entry point for a Spark class is invoked, the spark-class script checks if Java is installed in the user environment. # If Java is not installed, the user is prompted to select whether they want to install JDK 17. # If the user selects yes, JDK 17 is installed (using the [install-jdk library|https://pypi.org/project/install-jdk/]) and JAVA_HOME variable and RUNNER are set appropriately. The Spark build will now work! # If the user selects no, we provide them a brief description of how to install JDK manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890 ] Daniel deleted comment on SPARK-46890: was (Author: JIRAUSER285772): I think this `tokenIndexArr` within Spark's `UnivocityParser` class has different values in the passing and failing cases: {code:java} // This index is used to reorder parsed tokens private val tokenIndexArr = requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray{code} The presence of the default column metadata in the `requiredSchema` is causing the `dataSchema.indexOf` call to fail to match. We can possibly fix this by just stripping the default value metadata from the `requiredSchema` before computing this mapping. > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > Labels: pull-request-available > Attachments: image-2024-01-29-13-22-05-326.png > > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46890: --- Labels: pull-request-available (was: ) > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > Labels: pull-request-available > Attachments: image-2024-01-29-13-22-05-326.png > > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812072#comment-17812072 ] Daniel commented on SPARK-46890: [~maxgekk] I created a bug fix here: [https://github.com/apache/spark/pull/44939] > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > Labels: pull-request-available > Attachments: image-2024-01-29-13-22-05-326.png > > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46907) Show driver log location in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-46907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46907. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44936 [https://github.com/apache/spark/pull/44936] > Show driver log location in Spark History Server > > > Key: SPARK-46907 > URL: https://issues.apache.org/jira/browse/SPARK-46907 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812059#comment-17812059 ] Daniel edited comment on SPARK-46890 at 1/29/24 9:27 PM: - I think this `tokenIndexArr` within Spark's `UnivocityParser` class has different values in the passing and failing cases: {code:java} // This index is used to reorder parsed tokens private val tokenIndexArr = requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray{code} The presence of the default column metadata in the `requiredSchema` is causing the `dataSchema.indexOf` call to fail to match. We can possibly fix this by just stripping the default value metadata from the `requiredSchema` before computing this mapping. was (Author: JIRAUSER285772): I think this `tokenIndexArr` within Spark's `UnivocityParser` class has different values in the passing and failing cases: {code:java} // This index is used to reorder parsed tokens private val tokenIndexArr = requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray {code} > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > Attachments: image-2024-01-29-13-22-05-326.png > > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812059#comment-17812059 ] Daniel commented on SPARK-46890: I think this `tokenIndexArr` within Spark's `UnivocityParser` class has different values in the passing and failing cases: {code:java} // This index is used to reorder parsed tokens private val tokenIndexArr = requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray {code} > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > Attachments: image-2024-01-29-13-22-05-326.png > > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812058#comment-17812058 ] Daniel commented on SPARK-46890: The bug happens when the Univocity parser is converting the parsed column names to a result array of strings. This `columnsReordered` boolean is true when no column defaults are specified, but erroneously false otherwise: !image-2024-01-29-13-22-05-326.png! [1] https://github.com/apache/spark/blob/528ac8b3e8548a53d931007c36db3427c610f4da/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVHeaderChecker.scala#L127 > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > Attachments: image-2024-01-29-13-22-05-326.png > > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel updated SPARK-46890: --- Attachment: image-2024-01-29-13-22-05-326.png > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > Attachments: image-2024-01-29-13-22-05-326.png > > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46687) Implement memory-profiler
[ https://issues.apache.org/jira/browse/SPARK-46687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-46687. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44775 [https://github.com/apache/spark/pull/44775] > Implement memory-profiler > - > > Key: SPARK-46687 > URL: https://issues.apache.org/jira/browse/SPARK-46687 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812051#comment-17812051 ] Daniel commented on SPARK-46890: The exception comes from here: [https://github.com/apache/spark/blob/c468c3d5c685c5a5ecd7caf01f3004addce1f3b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVHeaderChecker.scala#L91] > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812047#comment-17812047 ] Daniel commented on SPARK-46890: Thanks [~maxgekk] both of the above tests reproduce the bug now. I will debug it. > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46908) Extend SELECT * support outside of select list
[ https://issues.apache.org/jira/browse/SPARK-46908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46908: --- Labels: SQL pull-request-available (was: SQL) > Extend SELECT * support outside of select list > -- > > Key: SPARK-46908 > URL: https://issues.apache.org/jira/browse/SPARK-46908 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Priority: Major > Labels: SQL, pull-request-available > > Traditionally * is confined to thr select list and there to the top level of > expressions. > Spark does, in an undocumented fashion support * in the SELECT list for > function argument list. > Here we want to expand upon this capability by adding the WHERE clause > (Filter) as well as a couple of more scenarios such as row value constructors > and IN operator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46909) Could not initialize class org.apache.spark.unsafe.array.ByteArrayMethods error in JDK 21
[ https://issues.apache.org/jira/browse/SPARK-46909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Sohn updated SPARK-46909: Affects Version/s: 3.5.0 > Could not initialize class org.apache.spark.unsafe.array.ByteArrayMethods > error in JDK 21 > - > > Key: SPARK-46909 > URL: https://issues.apache.org/jira/browse/SPARK-46909 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1, 3.5.0 >Reporter: Johnny Sohn >Priority: Major > > Trying to run spark on JDK 21, and we're getting this exception > {code:java} > Caused by: java.lang.NoClassDefFoundError: Could not initialize class > org.apache.spark.unsafe.array.ByteArrayMethods > at > org.apache.spark.memory.MemoryManager.defaultPageSizeBytes$lzycompute(MemoryManager.scala:264) > at > org.apache.spark.memory.MemoryManager.defaultPageSizeBytes(MemoryManager.scala:254) > at > org.apache.spark.memory.MemoryManager.$anonfun$pageSizeBytes$1(MemoryManager.scala:273) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.memory.MemoryManager.(MemoryManager.scala:273) > at > org.apache.spark.memory.UnifiedMemoryManager.(UnifiedMemoryManager.scala:58) > at > org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207) > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320) > at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194) > at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:279) > at org.apache.spark.SparkContext.(SparkContext.scala:464) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at > Caused by: java.lang.ExceptionInInitializerError: Exception > java.lang.ExceptionInInitializerError [in thread "skir-0"] > at > org.apache.spark.unsafe.array.ByteArrayMethods.(ByteArrayMethods.java:56) > at > org.apache.spark.memory.MemoryManager.defaultPageSizeBytes$lzycompute(MemoryManager.scala:264) > at > org.apache.spark.memory.MemoryManager.defaultPageSizeBytes(MemoryManager.scala:254) > at > org.apache.spark.memory.MemoryManager.$anonfun$pageSizeBytes$1(MemoryManager.scala:273) > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.memory.MemoryManager.(MemoryManager.scala:273) > at > org.apache.spark.memory.UnifiedMemoryManager.(UnifiedMemoryManager.scala:58) > at > org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207) > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320) > at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194) > at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:279) > at org.apache.spark.SparkContext.(SparkContext.scala:464) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46909) Could not initialize class org.apache.spark.unsafe.array.ByteArrayMethods error in JDK 21
Johnny Sohn created SPARK-46909: --- Summary: Could not initialize class org.apache.spark.unsafe.array.ByteArrayMethods error in JDK 21 Key: SPARK-46909 URL: https://issues.apache.org/jira/browse/SPARK-46909 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.1 Reporter: Johnny Sohn Trying to run spark on JDK 21, and we're getting this exception {code:java} Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.unsafe.array.ByteArrayMethods at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes$lzycompute(MemoryManager.scala:264) at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes(MemoryManager.scala:254) at org.apache.spark.memory.MemoryManager.$anonfun$pageSizeBytes$1(MemoryManager.scala:273) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.memory.MemoryManager.(MemoryManager.scala:273) at org.apache.spark.memory.UnifiedMemoryManager.(UnifiedMemoryManager.scala:58) at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:279) at org.apache.spark.SparkContext.(SparkContext.scala:464) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.ExceptionInInitializerError [in thread "skir-0"] at org.apache.spark.unsafe.array.ByteArrayMethods.(ByteArrayMethods.java:56) at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes$lzycompute(MemoryManager.scala:264) at org.apache.spark.memory.MemoryManager.defaultPageSizeBytes(MemoryManager.scala:254) at org.apache.spark.memory.MemoryManager.$anonfun$pageSizeBytes$1(MemoryManager.scala:273) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.memory.MemoryManager.(MemoryManager.scala:273) at org.apache.spark.memory.UnifiedMemoryManager.(UnifiedMemoryManager.scala:58) at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:279) at org.apache.spark.SparkContext.(SparkContext.scala:464) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812044#comment-17812044 ] Max Gekk edited comment on SPARK-46890 at 1/29/24 8:29 PM: --- [~dtenedor] Need to trigger the column pruning feature but your query {code:scala} spark.table("Products"),{code} doesn't do that. See my example: {code:sql} spark-sql (default)> SELECT price FROM products; {code} It requests only one column. was (Author: maxgekk): [~dtenedor] Need to trigger the column pruning feature but your query spark.table("Products"), doesn't do that. See my example: spark-sql (default)> SELECT price FROM products; It requests only one column. > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46908) Extend SELECT * support outside of select list
Serge Rielau created SPARK-46908: Summary: Extend SELECT * support outside of select list Key: SPARK-46908 URL: https://issues.apache.org/jira/browse/SPARK-46908 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Serge Rielau Traditionally * is confined to thr select list and there to the top level of expressions. Spark does, in an undocumented fashion support * in the SELECT list for function argument list. Here we want to expand upon this capability by adding the WHERE clause (Filter) as well as a couple of more scenarios such as row value constructors and IN operator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812044#comment-17812044 ] Max Gekk commented on SPARK-46890: -- [~dtenedor] Need to trigger the column pruning feature but your query spark.table("Products"), doesn't do that. See my example: spark-sql (default)> SELECT price FROM products; It requests only one column. > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812040#comment-17812040 ] Daniel commented on SPARK-46890: I tried the exact command in the bug description, and it doesn't cause any errors on the master branch: {code:java} withTable("Products") { spark.sql( s""" |CREATE TABLE IF NOT EXISTS Products ( | product_id INT, | name STRING, | price FLOAT default 0.0, | quantity INT default 0 |) |USING CSV |OPTIONS ( | header 'true', | inferSchema 'false', | enforceSchema 'false', | path "${testFile(productsFile)}" |) """.stripMargin) checkAnswer( spark.table("Products"), Seq( Row(1, "Apple", 0.50, 100), Row(2, "Banana", 0.25, 200), Row(3, "Orange", 0.75, 50))) } {code} With the "products.csv" file containing: {code:java} product_id,name,price,quantity 1,Apple,0.50,100 2,Banana,0.25,200 3,Orange,0.75,50 {code} This unit test passes. > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812034#comment-17812034 ] Daniel commented on SPARK-46890: This unit test does not seem to reproduce the problem: {code:java} test("SPARK-46890: CSV fails on a column with default and without enforcing schema") { withTable("CarsTable") { spark.sql( s""" |CREATE TABLE CarsTable( | year INT, | make STRING, | model STRING, | comment STRING DEFAULT '', | blank STRING DEFAULT '') |USING csv |OPTIONS ( | header "true", | inferSchema "false", | enforceSchema "false", | path "${testFile(carsFile)}" |) """.stripMargin) checkAnswer( spark.table("CarsTable"), Seq( Row(2012, "Tesla", "S", "No comment", null), Row(1997, "Ford", "E350", "Go get one now they are going fast", null), Row(2015, "Chevy", "Volt", "", "") )) } } {code} With the "cars.csv" file containing: {code:java} year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they are going fast", 2015,Chevy,Volt {code} Will look further. > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812034#comment-17812034 ] Daniel edited comment on SPARK-46890 at 1/29/24 7:31 PM: - This unit test does not seem to reproduce the problem: {code:java} test("SPARK-46890: CSV fails on a column with default and without enforcing schema") { withTable("CarsTable") { spark.sql( s""" |CREATE TABLE CarsTable( | year INT, | make STRING, | model STRING, | comment STRING DEFAULT '', | blank STRING DEFAULT '') |USING csv |OPTIONS ( | header "true", | inferSchema "false", | enforceSchema "false", | path "${testFile(carsFile)}" |) """.stripMargin) checkAnswer( spark.table("CarsTable"), Seq( Row(2012, "Tesla", "S", "No comment", null), Row(1997, "Ford", "E350", "Go get one now they are going fast", null), Row(2015, "Chevy", "Volt", "", "") )) } } {code} With the "cars.csv" file containing: {code:java} year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they are going fast", 2015,Chevy,Volt {code} Will look further. was (Author: JIRAUSER285772): This unit test does not seem to reproduce the problem: {code:java} test("SPARK-46890: CSV fails on a column with default and without enforcing schema") { withTable("CarsTable") { spark.sql( s""" |CREATE TABLE CarsTable( | year INT, | make STRING, | model STRING, | comment STRING DEFAULT '', | blank STRING DEFAULT '') |USING csv |OPTIONS ( | header "true", | inferSchema "false", | enforceSchema "false", | path "${testFile(carsFile)}" |) """.stripMargin) checkAnswer( spark.table("CarsTable"), Seq( Row(2012, "Tesla", "S", "No comment", null), Row(1997, "Ford", "E350", "Go get one now they are going fast", null), Row(2015, "Chevy", "Volt", "", "") )) } } {code} With the "cars.csv" file containing: {code:java} year,make,model,comment,blank "2012","Tesla","S","No comment", 1997,Ford,E350,"Go get one now they are going fast", 2015,Chevy,Volt {code} Will look further. > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46905) Add dedicated class to keep column definition instead of StructField in Create/ReplaceTable command
[ https://issues.apache.org/jira/browse/SPARK-46905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46905: --- Labels: pull-request-available (was: ) > Add dedicated class to keep column definition instead of StructField in > Create/ReplaceTable command > --- > > Key: SPARK-46905 > URL: https://issues.apache.org/jira/browse/SPARK-46905 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46907) Show driver log location in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-46907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46907: - Assignee: Dongjoon Hyun > Show driver log location in Spark History Server > > > Key: SPARK-46907 > URL: https://issues.apache.org/jira/browse/SPARK-46907 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46907) Show driver log location in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-46907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46907: --- Labels: pull-request-available (was: ) > Show driver log location in Spark History Server > > > Key: SPARK-46907 > URL: https://issues.apache.org/jira/browse/SPARK-46907 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46907) Show driver log location in Spark History Server
Dongjoon Hyun created SPARK-46907: - Summary: Show driver log location in Spark History Server Key: SPARK-46907 URL: https://issues.apache.org/jira/browse/SPARK-46907 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46890) CSV fails on a column with default and without enforcing schema
[ https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812002#comment-17812002 ] Daniel commented on SPARK-46890: Thanks [~maxgekk] for writing down the details here. It looks like this feature did not take into account the `enforceSchema` option properly. I can take a look. > CSV fails on a column with default and without enforcing schema > --- > > Key: SPARK-46890 > URL: https://issues.apache.org/jira/browse/SPARK-46890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Major > > When we create a table using CSV on an existing file with a header and: > - a column has an default + > - enforceSchema is false - taking into account CSV header > then query a column with a default. > The example below shows the issue: > {code:sql} > CREATE TABLE IF NOT EXISTS products ( > product_id INT, > name STRING, > price FLOAT default 0.0, > quantity INT default 0 > ) > USING CSV > OPTIONS ( > header 'true', > inferSchema 'false', > enforceSchema 'false', > path '/Users/maximgekk/tmp/products.csv' > ); > {code} > The CSV file products.csv: > {code:java} > product_id,name,price,quantity > 1,Apple,0.50,100 > 2,Banana,0.25,200 > 3,Orange,0.75,50 > {code} > The query fails: > {code:sql} > spark-sql (default)> SELECT price FROM products; > 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6) > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > Header length: 4, schema size: 1 > CSV file: file:///Users/maximgekk/tmp/products.csv > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46905) Add dedicated class to keep column definition instead of StructField in Create/ReplaceTable command
Wenchen Fan created SPARK-46905: --- Summary: Add dedicated class to keep column definition instead of StructField in Create/ReplaceTable command Key: SPARK-46905 URL: https://issues.apache.org/jira/browse/SPARK-46905 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46904) Fix wrong display of History UI summary
[ https://issues.apache.org/jira/browse/SPARK-46904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46904. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44934 [https://github.com/apache/spark/pull/44934] > Fix wrong display of History UI summary > - > > Key: SPARK-46904 > URL: https://issues.apache.org/jira/browse/SPARK-46904 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46904) Fix wrong display of History UI summary
[ https://issues.apache.org/jira/browse/SPARK-46904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46904: - Assignee: Kent Yao > Fix wrong display of History UI summary > - > > Key: SPARK-46904 > URL: https://issues.apache.org/jira/browse/SPARK-46904 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46904) Fix wrong display of History UI summary
[ https://issues.apache.org/jira/browse/SPARK-46904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46904: -- Parent: SPARK-46001 Issue Type: Sub-task (was: Bug) > Fix wrong display of History UI summary > - > > Key: SPARK-46904 > URL: https://issues.apache.org/jira/browse/SPARK-46904 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811923#comment-17811923 ] Nicholas Chammas commented on SPARK-46810: -- I think Option 3 is a good compromise that lets us continue calling {{INCOMPLETE_TYPE_DEFINITION}} an "error class", which perhaps would be the least disruptive to Spark developers. However, for the record, the SQL standard only seems to use the term "class" in the context of the 5-character SQLSTATE. Otherwise, the standard uses the term "condition" or "exception condition". I don't have a copy of the SQL 2016 standard handy. It's not available on ISO's website for sale, actually. The only option appears to be to purchase [the SQL 2023 standard for ~$220|https://www.iso.org/standard/76583.html]. However, there is a copy of the [SQL 1992 standard available publicly|https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt]. Table 23 on page 619 is relevant: {code} Table_23-SQLSTATE_class_and_subclass_values _Condition__Class_Subcondition___Subclass | ambiguous cursor name| 3C | (no subclass)| 000 | | | | | | | | | | | | cardinality violation| 21 | (no subclass)| 000 | | | | | | | connection exception | 08 | (no subclass)| 000 | | | | | | | | | connection does not | 003 | exist | | | connection failure | 006 | | | | | | | | | connection name in use | 002 | | | | | | | | | SQL-client unable to | 001 | establish SQL-connection ... {code} I think this maps closest to Option 1, but again if we want to go with Option 3 I think that's reasonable too. But in the case of Option 3 we should then retire [our use of the term "error condition"|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html] so that we don't use multiple terms to refer to the same thing. > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1:
[jira] [Assigned] (SPARK-46831) Extend StringType and PhysicalStringType with collation id
[ https://issues.apache.org/jira/browse/SPARK-46831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-46831: Assignee: Aleksandar Tomic > Extend StringType and PhysicalStringType with collation id > -- > > Key: SPARK-46831 > URL: https://issues.apache.org/jira/browse/SPARK-46831 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46831) Extend StringType and PhysicalStringType with collation id
[ https://issues.apache.org/jira/browse/SPARK-46831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-46831. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44901 [https://github.com/apache/spark/pull/44901] > Extend StringType and PhysicalStringType with collation id > -- > > Key: SPARK-46831 > URL: https://issues.apache.org/jira/browse/SPARK-46831 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811858#comment-17811858 ] Max Gekk commented on SPARK-46810: -- [~cloud_fan] [~LuciferYang] [~beliefer] [~dongjoon] WDYT? > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" > One solution is to use the following terms: > * Error class: 42 > * Error sub-class: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-condition: ARRAY, MAP, STRUCT > Pros: > * This terminology seems (to me at least) the most natural and intuitive. > * It may also match the SQL standard. > Cons: > * We use {{errorClass}} [all over our > codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] > – literally in thousands of places – to refer to strings like > INCOMPLETE_TYPE_DEFINITION. > ** It's probably not practical to update all these usages to say > {{errorCondition}} instead, so if we go with this approach there will be a > divide between the terminology we use in user-facing documentation vs. what > the code base uses. > ** We can perhaps rename the existing {{error-classes.json}} to > {{error-conditions.json}} but clarify the reason for this divide between code > and user docs in the documentation for {{ErrorClassesJsonReader}} . > h1. Option 2: 42 becomes an "Error Category" > Another approach is to use the following terminology: > * Error category: 42 > * Error sub-category: K01 > * Error state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a class to a category is low impact and may > not show up in user-facing documentation at all. (See my side note below.) > Cons: > * These terms may not align with the SQL standard. > * We will have to retire the term "error condition", which we have [already > used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] > in user-facing documentation. > — > Side note: In either case, I believe talking about "42" and "K01" – > regardless of what we end up calling them – in front of users is not helpful. > I don't think anybody cares what "42" by itself means, or what "K01" by > itself means. Accordingly, we should limit how much we talk about these > concepts in the user-facing documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811856#comment-17811856 ] Max Gekk commented on SPARK-46810: -- Correct me if I am wrong but the SQL standard says about classes and sub-classes of SQLSTATE not about error classes which I think are different things. What about the option 3: * SQL state class: 42 * SQL state sub-class: K01 * SQL state: 42K01 * Error class: INCOMPLETE_TYPE_DEFINITION * Error sub-classes: ARRAY, MAP, STRUCT > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" > One solution is to use the following terms: > * Error class: 42 > * Error sub-class: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-condition: ARRAY, MAP, STRUCT > Pros: > * This terminology seems (to me at least) the most natural and intuitive. > * It may also match the SQL standard. > Cons: > * We use {{errorClass}} [all over our > codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] > – literally in thousands of places – to refer to strings like > INCOMPLETE_TYPE_DEFINITION. > ** It's probably not practical to update all these usages to say > {{errorCondition}} instead, so if we go with this approach there will be a > divide between the terminology we use in user-facing documentation vs. what > the code base uses. > ** We can perhaps rename the existing {{error-classes.json}} to > {{error-conditions.json}} but clarify the reason for this divide between code > and user docs in the documentation for {{ErrorClassesJsonReader}} . > h1. Option 2: 42 becomes an "Error Category" > Another approach is to use the following terminology: > * Error category: 42 > * Error sub-category: K01 > * Error state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a class to a category is low impact and may > not show up in user-facing documentation at all. (See my side note below.) > Cons: > * These terms may not align with the SQL standard. > * We will have to retire the term "error condition", which we have [already > used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] > in user-facing documentation. > — > Side note: In either case, I believe talking about "42" and "K01" – > regardless of what we end up calling them – in front of users is not helpful. >
[jira] [Comment Edited] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811850#comment-17811850 ] Willi Raschkowski edited comment on SPARK-46893 at 1/29/24 12:15 PM: - cc [~dongjoon], for your awareness as PMC who's recently touched the UI. I'm wondering if we should file a CVE for this. was (Author: raschkowski): [~dongjoon], for your awareness as PMC who's recently touched the UI. I'm wondering if we should file a CVE for this. > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Labels: pull-request-available > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g.,
[jira] [Commented] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811850#comment-17811850 ] Willi Raschkowski commented on SPARK-46893: --- [~dongjoon], for your awareness as PMC who's recently touched the UI. I'm wondering if we should file a CVE for this. > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Labels: pull-request-available > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g.,
[jira] [Updated] (SPARK-46904) Fix wrong display of History UI summary
[ https://issues.apache.org/jira/browse/SPARK-46904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46904: --- Labels: pull-request-available (was: ) > Fix wrong display of History UI summary > - > > Key: SPARK-46904 > URL: https://issues.apache.org/jira/browse/SPARK-46904 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46904) Fix wrong display of History UI summary
Kent Yao created SPARK-46904: Summary: Fix wrong display of History UI summary Key: SPARK-46904 URL: https://issues.apache.org/jira/browse/SPARK-46904 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46903) Support Spark History Server Log UI
[ https://issues.apache.org/jira/browse/SPARK-46903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46903. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44932 [https://github.com/apache/spark/pull/44932] > Support Spark History Server Log UI > --- > > Key: SPARK-46903 > URL: https://issues.apache.org/jira/browse/SPARK-46903 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46902) Fix Spark History Server UI for using un-exported setAppLimit
[ https://issues.apache.org/jira/browse/SPARK-46902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46902: -- Summary: Fix Spark History Server UI for using un-exported setAppLimit (was: Fix Spark History Server UI ) > Fix Spark History Server UI for using un-exported setAppLimit > - > > Key: SPARK-46902 > URL: https://issues.apache.org/jira/browse/SPARK-46902 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46902) Fix Spark History Server UI
[ https://issues.apache.org/jira/browse/SPARK-46902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46902. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44931 [https://github.com/apache/spark/pull/44931] > Fix Spark History Server UI > > > Key: SPARK-46902 > URL: https://issues.apache.org/jira/browse/SPARK-46902 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46902) Fix Spark History Server UI
[ https://issues.apache.org/jira/browse/SPARK-46902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46902: - Assignee: Kent Yao > Fix Spark History Server UI > > > Key: SPARK-46902 > URL: https://issues.apache.org/jira/browse/SPARK-46902 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46893: --- Labels: pull-request-available (was: ) > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Labels: pull-request-available > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g.,
[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-46893: -- Description: Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} handlers) in the UI job and stage descriptions. The UI already has precaution to treat, e.g.,
[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-46893: -- Description: Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} handlers) in the UI job and stage descriptions. The UI already has precaution to treat, e.g.,
[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-46893: -- Attachment: Screenshot 2024-01-29 at 09.06.34.png > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov, Screenshot > 2024-01-29 at 09.06.34.png > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g.,
[jira] [Updated] (SPARK-46893) Remove inline scripts from UI descriptions
[ https://issues.apache.org/jira/browse/SPARK-46893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-46893: -- Summary: Remove inline scripts from UI descriptions (was: Sanitize UI descriptions from inline scripts) > Remove inline scripts from UI descriptions > -- > > Key: SPARK-46893 > URL: https://issues.apache.org/jira/browse/SPARK-46893 > Project: Spark > Issue Type: Bug > Components: UI, Web UI >Affects Versions: 3.4.1 >Reporter: Willi Raschkowski >Priority: Major > Attachments: Screen Recording 2024-01-28 at 17.51.47.mov > > > Users can inject inline scripts (e.g. {{onclick}} or {{onmouseover}} > handlers) in the UI job and stage descriptions. > The UI already has precaution to treat, e.g.,
[jira] [Assigned] (SPARK-46902) Fix Spark History Server UI
[ https://issues.apache.org/jira/browse/SPARK-46902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46902: -- Assignee: (was: Apache Spark) > Fix Spark History Server UI > > > Key: SPARK-46902 > URL: https://issues.apache.org/jira/browse/SPARK-46902 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46902) Fix Spark History Server UI
[ https://issues.apache.org/jira/browse/SPARK-46902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46902: -- Assignee: Apache Spark > Fix Spark History Server UI > > > Key: SPARK-46902 > URL: https://issues.apache.org/jira/browse/SPARK-46902 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46903) Support Spark History Server Log UI
[ https://issues.apache.org/jira/browse/SPARK-46903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46903: - Assignee: Dongjoon Hyun > Support Spark History Server Log UI > --- > > Key: SPARK-46903 > URL: https://issues.apache.org/jira/browse/SPARK-46903 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46903) Support Spark History Server Log UI
[ https://issues.apache.org/jira/browse/SPARK-46903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46903: --- Labels: pull-request-available (was: ) > Support Spark History Server Log UI > --- > > Key: SPARK-46903 > URL: https://issues.apache.org/jira/browse/SPARK-46903 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46903) Support Spark History Server Log UI
Dongjoon Hyun created SPARK-46903: - Summary: Support Spark History Server Log UI Key: SPARK-46903 URL: https://issues.apache.org/jira/browse/SPARK-46903 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46902) Fix Spark History Server UI
[ https://issues.apache.org/jira/browse/SPARK-46902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46902: --- Labels: pull-request-available (was: ) > Fix Spark History Server UI > > > Key: SPARK-46902 > URL: https://issues.apache.org/jira/browse/SPARK-46902 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46902) Fix Spark History Server UI
Kent Yao created SPARK-46902: Summary: Fix Spark History Server UI Key: SPARK-46902 URL: https://issues.apache.org/jira/browse/SPARK-46902 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org