[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-32810: ------------------------------------ Assignee: Apache Spark > CSV/JSON data sources should avoid globbing paths when inferring schema > ----------------------------------------------------------------------- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.0 > Reporter: Maxim Gekk > Assignee: Apache Spark > Priority: Major > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelation tries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply ^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org