Maxim Gekk created SPARK-32810: ---------------------------------- Summary: CSV/JSON data sources should avoid globbing paths when inferring schema Key: SPARK-32810 URL: https://issues.apache.org/jira/browse/SPARK-32810 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk
The problem is that when the user doesn't specify the schema when reading a CSV table, The CSV file format and data source needs to infer schema, and it does so by creating a base DataSource relation, and there's a mismatch: *FileFormat.inferSchema* expects actual file paths without glob patterns, but *DataSource.paths* expects file paths in glob patterns. An example is demonstrated below: {code:java} ^ | DataSource.resolveRelation tries to glob again (incorrectly) on glob pattern """[abc].csv""" | DataSource.apply ^ | CSVDataSource.inferSchema | | CSVFileFormat.inferSchema | | ... | | DataSource.resolveRelation globbed into """[abc].csv""", should be treated as verbatim path, not as glob pattern | DataSource.apply ^ | DataFrameReader.load | | input """\[abc\].csv""" {code} The same problem exists in the JSON data source as well. Ditto for MLlib's LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org