[jira] [Updated] (SPARK-49010) Add unit tests for XML case sensitivity
[ https://issues.apache.org/jira/browse/SPARK-49010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang updated SPARK-49010: - Description: Currently, XML respects the case sensitivity SQLConf (default to false) in the schema inference but we lack unit tests to verify the behavior. This PR adds more unit tests to it. (was: Currently, XML respects the case sensitivity SQLConf (default to false) in the schema inference but we lack unit tests to verify the behavior. This PR adds more unit tests to this,) > Add unit tests for XML case sensitivity > --- > > Key: SPARK-49010 > URL: https://issues.apache.org/jira/browse/SPARK-49010 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Shujing Yang >Priority: Major > > Currently, XML respects the case sensitivity SQLConf (default to false) in > the schema inference but we lack unit tests to verify the behavior. This PR > adds more unit tests to it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49010) Add unit tests for XML case sensitivity
Shujing Yang created SPARK-49010: Summary: Add unit tests for XML case sensitivity Key: SPARK-49010 URL: https://issues.apache.org/jira/browse/SPARK-49010 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Shujing Yang Currently, XML respects the case sensitivity SQLConf (default to false) in the schema inference but we lack unit tests to verify the behavior. This PR adds more unit tests to this, -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema
[ https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang updated SPARK-48100: - Description: Previously, the XML parser can't skip nested structure data fields effectively when they were not selected in the schema. For instance, in the below example, `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't effectively skipped. This PR fixes this issue. {code:java} 1 2 {code} was: Previously, the XML parser can't skip nested structure data fields when they were not selected in the schema. For instance, in the below example, `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't effectively skipped. This PR fixes this issue. {code:java} 1 2 {code} > [SQL][XML] Fix issues in skipping nested structure fields not selected in > schema > > > Key: SPARK-48100 > URL: https://issues.apache.org/jira/browse/SPARK-48100 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Shujing Yang >Priority: Major > Labels: pull-request-available > > Previously, the XML parser can't skip nested structure data fields > effectively when they were not selected in the schema. For instance, in the > below example, `df.select("struct2").collect()` returns `Seq(null)` as > `struct1` wasn't effectively skipped. This PR fixes this issue. > {code:java} > > > 1 > > > 2 > > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema
[ https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang updated SPARK-48100: - Summary: [SQL][XML] Fix issues in skipping nested structure fields not selected in schema (was: [SQL][XML] Fix projection issue when there's a nested struct) > [SQL][XML] Fix issues in skipping nested structure fields not selected in > schema > > > Key: SPARK-48100 > URL: https://issues.apache.org/jira/browse/SPARK-48100 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Shujing Yang >Priority: Major > > Previously, the XML parser can't skip nested structure data fields when they > were not selected in the schema. For instance, in the below example, > `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't > effectively skipped. This PR fixes this issue. > {code:java} > > > 1 > > > 2 > > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48100) [SQL][XML] Fix projection issue when there's a nested struct
Shujing Yang created SPARK-48100: Summary: [SQL][XML] Fix projection issue when there's a nested struct Key: SPARK-48100 URL: https://issues.apache.org/jira/browse/SPARK-48100 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Shujing Yang Previously, the XML parser can't skip nested structure data fields when they were not selected in the schema. For instance, in the below example, `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't effectively skipped. This PR fixes this issue. {code:java} 1 2 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47309) [XML] Add schema inference unit tests
Shujing Yang created SPARK-47309: Summary: [XML] Add schema inference unit tests Key: SPARK-47309 URL: https://issues.apache.org/jira/browse/SPARK-47309 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Shujing Yang As titled. It also fixes schema inference issue 1) when there's an empty tag 2) when merging schema for NullType -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46848) XML: Add support to partial results
Shujing Yang created SPARK-46848: Summary: XML: Add support to partial results Key: SPARK-46848 URL: https://issues.apache.org/jira/browse/SPARK-46848 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Shujing Yang Add support to partial results in XML bad record handling -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46382) XML: Capture values interspersed between elements
Shujing Yang created SPARK-46382: Summary: XML: Capture values interspersed between elements Key: SPARK-46382 URL: https://issues.apache.org/jira/browse/SPARK-46382 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Shujing Yang In XML, elements typically consist of a name and a value, with the value enclosed between the opening and closing tags. But XML also allows to include arbitrary values interspersed between these elements. To address this, we provide an option named `valueTags`, which is enabled by default, to capture these values. Consider the following example: ``` 1 value1 value2 2 value3 ``` In this example, ``,``, and `` are named elements with their respective values enclosed within tags. There are arbitrary values value1 value2 value3 interspersed between the elements. Please note that there can be multiple occurrences of values in a single element (i.e. there are value2, value3 in the element ) We should parse the values between tags into the valueTags field. If there are multiple occurrences of value tags, the value tag field will be converted to an array type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46248) Support ignoreCorruptFiles and ignoreMissingFiles options in XML
Shujing Yang created SPARK-46248: Summary: Support ignoreCorruptFiles and ignoreMissingFiles options in XML Key: SPARK-46248 URL: https://issues.apache.org/jira/browse/SPARK-46248 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Shujing Yang This PR corrects the handling of corrupt or missing multiline XML files by respecting user-specific options. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45928) Fix schema merging for nested structures
Shujing Yang created SPARK-45928: Summary: Fix schema merging for nested structures Key: SPARK-45928 URL: https://issues.apache.org/jira/browse/SPARK-45928 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0, 3.5.1 Reporter: Shujing Yang Previously, when Parquet merges {*}nested structures{*}, it doesn’t respect the SQLConf case-sensitive configuration and thus leads to an analysisException. This PR fixes this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45912) Enhancement of XSDToSchema API: Change to HDFS API for cloud storage accessibility
[ https://issues.apache.org/jira/browse/SPARK-45912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang updated SPARK-45912: - Summary: Enhancement of XSDToSchema API: Change to HDFS API for cloud storage accessibility (was: Enhancement of XSDToSchema API: Transit to HDFS API for cloud storage accessibility) > Enhancement of XSDToSchema API: Change to HDFS API for cloud storage > accessibility > --- > > Key: SPARK-45912 > URL: https://issues.apache.org/jira/browse/SPARK-45912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Shujing Yang >Priority: Major > > Previously, it utilized `java.nio.path`, which limited file reading to local > file systems only. By changing this to an HDFS-compatible API, we now enable > the XSDToSchema function to access files in cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45912) Enhancement of XSDToSchema API: Transit to HDFS API for cloud storage accessibility
Shujing Yang created SPARK-45912: Summary: Enhancement of XSDToSchema API: Transit to HDFS API for cloud storage accessibility Key: SPARK-45912 URL: https://issues.apache.org/jira/browse/SPARK-45912 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Shujing Yang Previously, it utilized `java.nio.path`, which limited file reading to local file systems only. By changing this to an HDFS-compatible API, we now enable the XSDToSchema function to access files in cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45844) Implement case insensitivity for XML
Shujing Yang created SPARK-45844: Summary: Implement case insensitivity for XML Key: SPARK-45844 URL: https://issues.apache.org/jira/browse/SPARK-45844 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Shujing Yang Currently, we don't follow the `SQLConf` of case insensitivity in XML, which is inconsistent with other file formats. This PR implements the case-insensitive behavior for schema inference and file reads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.
[ https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang resolved SPARK-45653. -- Resolution: Not A Problem > Refractor XMLSuite to allow other test suites to easily extend and override. > > > Key: SPARK-45653 > URL: https://issues.apache.org/jira/browse/SPARK-45653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Shujing Yang >Priority: Major > Labels: pull-request-available > > Refactor XmlSuite to integrate dataframe readers, allowing other test suites > to easily extend and override. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.
[ https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shujing Yang updated SPARK-45653: - Summary: Refractor XMLSuite to allow other test suites to easily extend and override. (was: Refractor XMLSuite) > Refractor XMLSuite to allow other test suites to easily extend and override. > > > Key: SPARK-45653 > URL: https://issues.apache.org/jira/browse/SPARK-45653 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Shujing Yang >Priority: Major > > Refactor XmlSuite to integrate dataframe readers, allowing other test suites > to easily extend and override. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45653) Refractor XMLSuite
Shujing Yang created SPARK-45653: Summary: Refractor XMLSuite Key: SPARK-45653 URL: https://issues.apache.org/jira/browse/SPARK-45653 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Shujing Yang Refactor XmlSuite to integrate dataframe readers, allowing other test suites to easily extend and override. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org