[jira] [Updated] (SPARK-49010) Add unit tests for XML case sensitivity

2024-07-25 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang updated SPARK-49010:
-
Description: Currently, XML respects the case sensitivity SQLConf (default 
to false) in the schema inference but we lack unit tests to verify the 
behavior. This PR adds more unit tests to it.  (was: Currently, XML respects 
the case sensitivity SQLConf (default to false) in the schema inference but we 
lack unit tests to verify the behavior. This PR adds more unit tests to this,)

> Add unit tests for XML case sensitivity
> ---
>
> Key: SPARK-49010
> URL: https://issues.apache.org/jira/browse/SPARK-49010
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Shujing Yang
>Priority: Major
>
> Currently, XML respects the case sensitivity SQLConf (default to false) in 
> the schema inference but we lack unit tests to verify the behavior. This PR 
> adds more unit tests to it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49010) Add unit tests for XML case sensitivity

2024-07-25 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-49010:


 Summary: Add unit tests for XML case sensitivity
 Key: SPARK-49010
 URL: https://issues.apache.org/jira/browse/SPARK-49010
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Shujing Yang


Currently, XML respects the case sensitivity SQLConf (default to false) in the 
schema inference but we lack unit tests to verify the behavior. This PR adds 
more unit tests to this,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema

2024-05-02 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang updated SPARK-48100:
-
Description: 
Previously, the XML parser can't skip nested structure data fields effectively 
when they were not selected in the schema. For instance, in the below example, 
`df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't 
effectively skipped. This PR fixes this issue.
{code:java}

  
    1
  
  
    2
  
{code}
 

  was:
Previously, the XML parser can't skip nested structure data fields when they 
were not selected in the schema. For instance, in the below example, 
`df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't 
effectively skipped. This PR fixes this issue.
{code:java}

  
    1
  
  
    2
  
{code}
 


> [SQL][XML] Fix issues in skipping nested structure fields not selected in 
> schema
> 
>
> Key: SPARK-48100
> URL: https://issues.apache.org/jira/browse/SPARK-48100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Shujing Yang
>Priority: Major
>  Labels: pull-request-available
>
> Previously, the XML parser can't skip nested structure data fields 
> effectively when they were not selected in the schema. For instance, in the 
> below example, `df.select("struct2").collect()` returns `Seq(null)` as 
> `struct1` wasn't effectively skipped. This PR fixes this issue.
> {code:java}
> 
>   
>     1
>   
>   
>     2
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48100) [SQL][XML] Fix issues in skipping nested structure fields not selected in schema

2024-05-02 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang updated SPARK-48100:
-
Summary: [SQL][XML] Fix issues in skipping nested structure fields not 
selected in schema  (was: [SQL][XML] Fix projection issue when there's a nested 
struct)

> [SQL][XML] Fix issues in skipping nested structure fields not selected in 
> schema
> 
>
> Key: SPARK-48100
> URL: https://issues.apache.org/jira/browse/SPARK-48100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Shujing Yang
>Priority: Major
>
> Previously, the XML parser can't skip nested structure data fields when they 
> were not selected in the schema. For instance, in the below example, 
> `df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't 
> effectively skipped. This PR fixes this issue.
> {code:java}
> 
>   
>     1
>   
>   
>     2
>   
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48100) [SQL][XML] Fix projection issue when there's a nested struct

2024-05-02 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-48100:


 Summary: [SQL][XML] Fix projection issue when there's a nested 
struct
 Key: SPARK-48100
 URL: https://issues.apache.org/jira/browse/SPARK-48100
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Shujing Yang


Previously, the XML parser can't skip nested structure data fields when they 
were not selected in the schema. For instance, in the below example, 
`df.select("struct2").collect()` returns `Seq(null)` as `struct1` wasn't 
effectively skipped. This PR fixes this issue.
{code:java}

  
    1
  
  
    2
  
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47309) [XML] Add schema inference unit tests

2024-03-06 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-47309:


 Summary: [XML] Add schema inference unit tests
 Key: SPARK-47309
 URL: https://issues.apache.org/jira/browse/SPARK-47309
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Shujing Yang


As titled.

It also fixes schema inference issue

1) when there's an empty tag

2) when merging schema for NullType



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46848) XML: Add support to partial results

2024-01-24 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-46848:


 Summary: XML: Add support to partial results
 Key: SPARK-46848
 URL: https://issues.apache.org/jira/browse/SPARK-46848
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Shujing Yang


Add support to partial results in XML bad record handling



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46382) XML: Capture values interspersed between elements

2023-12-12 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-46382:


 Summary: XML: Capture values interspersed between elements
 Key: SPARK-46382
 URL: https://issues.apache.org/jira/browse/SPARK-46382
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Shujing Yang


In XML, elements typically consist of a name and a value, with the value 
enclosed between the opening and closing tags. But XML also allows to include 
arbitrary values interspersed between these elements. To address this, we 
provide an option named `valueTags`, which is enabled by default, to capture 
these values. Consider the following example:

```


    1
  value1
  
    value2
    2
    value3
  


```
In this example, ``,``, and `` are named elements with their 
respective values enclosed within tags. There are arbitrary values value1 
value2 value3 interspersed between the elements. Please note that there can be 
multiple occurrences of values in a single element (i.e. there are value2, 
value3 in the element )

 

We should parse the values between tags into the valueTags field. If there are 
multiple occurrences of value tags, the value tag field will be converted to an 
array type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46248) Support ignoreCorruptFiles and ignoreMissingFiles options in XML

2023-12-04 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-46248:


 Summary: Support ignoreCorruptFiles and ignoreMissingFiles options 
in XML
 Key: SPARK-46248
 URL: https://issues.apache.org/jira/browse/SPARK-46248
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Shujing Yang


This PR corrects the handling of corrupt or missing multiline XML files by 
respecting user-specific options.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45928) Fix schema merging for nested structures

2023-11-14 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-45928:


 Summary: Fix schema merging for nested structures
 Key: SPARK-45928
 URL: https://issues.apache.org/jira/browse/SPARK-45928
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0, 3.5.1
Reporter: Shujing Yang


Previously, when Parquet merges {*}nested structures{*}, it doesn’t respect the 
SQLConf case-sensitive configuration and thus leads to an analysisException. 
This PR fixes this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45912) Enhancement of XSDToSchema API: Change to HDFS API for cloud storage accessibility

2023-11-13 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang updated SPARK-45912:
-
Summary: Enhancement  of XSDToSchema API: Change to HDFS API for cloud 
storage accessibility  (was: Enhancement  of XSDToSchema API: Transit to HDFS 
API for cloud storage accessibility)

> Enhancement  of XSDToSchema API: Change to HDFS API for cloud storage 
> accessibility
> ---
>
> Key: SPARK-45912
> URL: https://issues.apache.org/jira/browse/SPARK-45912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Shujing Yang
>Priority: Major
>
> Previously, it utilized `java.nio.path`, which limited file reading to local 
> file systems only. By changing this to an HDFS-compatible API, we now enable 
> the XSDToSchema function to access files in cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45912) Enhancement of XSDToSchema API: Transit to HDFS API for cloud storage accessibility

2023-11-13 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-45912:


 Summary: Enhancement  of XSDToSchema API: Transit to HDFS API for 
cloud storage accessibility
 Key: SPARK-45912
 URL: https://issues.apache.org/jira/browse/SPARK-45912
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Shujing Yang


Previously, it utilized `java.nio.path`, which limited file reading to local 
file systems only. By changing this to an HDFS-compatible API, we now enable 
the XSDToSchema function to access files in cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45844) Implement case insensitivity for XML

2023-11-08 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-45844:


 Summary: Implement case insensitivity for XML
 Key: SPARK-45844
 URL: https://issues.apache.org/jira/browse/SPARK-45844
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Shujing Yang


Currently, we don't follow the `SQLConf` of case insensitivity in XML, which is 
inconsistent with other file formats. This PR implements the case-insensitive 
behavior for schema inference and file reads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.

2023-10-24 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang resolved SPARK-45653.
--
Resolution: Not A Problem

> Refractor XMLSuite to allow other test suites to easily extend and override.
> 
>
> Key: SPARK-45653
> URL: https://issues.apache.org/jira/browse/SPARK-45653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Shujing Yang
>Priority: Major
>  Labels: pull-request-available
>
> Refactor XmlSuite to integrate dataframe readers, allowing other test suites 
> to easily extend and override.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45653) Refractor XMLSuite to allow other test suites to easily extend and override.

2023-10-24 Thread Shujing Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shujing Yang updated SPARK-45653:
-
Summary: Refractor XMLSuite to allow other test suites to easily extend and 
override.  (was: Refractor XMLSuite)

> Refractor XMLSuite to allow other test suites to easily extend and override.
> 
>
> Key: SPARK-45653
> URL: https://issues.apache.org/jira/browse/SPARK-45653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Shujing Yang
>Priority: Major
>
> Refactor XmlSuite to integrate dataframe readers, allowing other test suites 
> to easily extend and override.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45653) Refractor XMLSuite

2023-10-24 Thread Shujing Yang (Jira)
Shujing Yang created SPARK-45653:


 Summary: Refractor XMLSuite
 Key: SPARK-45653
 URL: https://issues.apache.org/jira/browse/SPARK-45653
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Shujing Yang


Refactor XmlSuite to integrate dataframe readers, allowing other test suites to 
easily extend and override.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org