[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

Apache Spark (Jira) Mon, 07 Sep 2020 01:46:10 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-32810:
------------------------------------

    Assignee: Apache Spark

> CSV/JSON data sources should avoid globbing paths when inferring schema
> -----------------------------------------------------------------------
>
>                 Key: SPARK-32810
>                 URL: https://issues.apache.org/jira/browse/SPARK-32810
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Maxim Gekk
>            Assignee: Apache Spark
>            Priority: Major
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> |         DataSource.resolveRelation    tries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> |         DataSource.apply                      ^
> |       CSVDataSource.inferSchema               |
> |     CSVFileFormat.inferSchema                 |
> |   ...                                         |
> |   DataSource.resolveRelation          globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply                            ^
> | DataFrameReader.load                          |
> |                                       input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

Reply via email to