[ https://issues.apache.org/jira/browse/SPARK-32206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qionghui Zhang updated SPARK-32206: ----------------------------------- Description: I'm using azure data lake gen2, when I'm loading data frame with certain options: var df = spark.read.format("csv") .option("ignoreLeadingWhiteSpace", "true") .option("ignoreTrailingWhiteSpace", "true") .option("parserLib", "UNIVOCITY") .option("multiline", "true") .option("inferSchema", "true") .option("mode", "PERMISSIVE") .option("quote", "\"") .option("escape", "\"") .option("timeStampFormat", "M/d/yyyy H:m:s a") .load("abfss://\{containername}@\{storage}.dfs.core.windows.net/\{somedirectory}/\{somefilesnapshotname}yyyy-MM-dd'T'hh:mm:ss") .limit(1) It will load data correctly. But if I use \{DirectoryWithColon}, it will thrown error: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: \{somefilesnapshotname}yyyy-MM-dd'T'hh:mm:ss Then if I remove .option("multiline", "true"), data can be loaded, but for sure that the dataframe is not handled correctly because there are newline character. So I believe it is a bug. And since our production is running correctly if we enable spark.read.schema(\{SomeSchemaList}).format("csv"), and we want to use inferschema feature on those file path with colon or other special characters, could you help fix this issue? was: I'm using azure data lake gen2, when I'm loading data frame with certain options: var df = spark.read.format("csv") .option("ignoreLeadingWhiteSpace", "true") .option("ignoreTrailingWhiteSpace", "true") .option("parserLib", "UNIVOCITY") .option("multiline", "true") .option("inferSchema", "true") .option("mode", "PERMISSIVE") .option("quote", "\"") .option("escape", "\"") .option("timeStampFormat", "M/d/yyyy H:m:s a") .load("abfss://\{containername}@\{storage}.dfs.core.windows.net/\{DirectoryWithoutColon}") .limit(1) It will load data correctly. But if I use \{DirectoryWithColon}, it will thrown error: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: \{somefilesnapshotname}yyyy-MM-dd'T'hh:mm:ss Then if I remove .option("multiline", "true"), data can be loaded, but for sure that the dataframe is not handled correctly because there are newline character. So I believe it is a bug. And since our production is running correctly if we enable spark.read.schema(\{SomeSchemaList}).format("csv"), and we want to use inferschema feature on those file path with colon or other special characters, could you help fix this issue? > Enable multi-line true could break the read csv in Azure Data Lake Storage > gen2 > ------------------------------------------------------------------------------- > > Key: SPARK-32206 > URL: https://issues.apache.org/jira/browse/SPARK-32206 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.2, 2.4.5 > Reporter: Qionghui Zhang > Priority: Major > > I'm using azure data lake gen2, when I'm loading data frame with certain > options: > var df = spark.read.format("csv") > .option("ignoreLeadingWhiteSpace", "true") > .option("ignoreTrailingWhiteSpace", "true") > .option("parserLib", "UNIVOCITY") > .option("multiline", "true") > .option("inferSchema", "true") > .option("mode", "PERMISSIVE") > .option("quote", "\"") > .option("escape", "\"") > .option("timeStampFormat", "M/d/yyyy H:m:s a") > > .load("abfss://\{containername}@\{storage}.dfs.core.windows.net/\{somedirectory}/\{somefilesnapshotname}yyyy-MM-dd'T'hh:mm:ss") > .limit(1) > It will load data correctly. > > But if I use \{DirectoryWithColon}, it will thrown error: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: \{somefilesnapshotname}yyyy-MM-dd'T'hh:mm:ss > > Then if I remove .option("multiline", "true"), data can be loaded, but for > sure that the dataframe is not handled correctly because there are newline > character. > > So I believe it is a bug. > > And since our production is running correctly if we enable > spark.read.schema(\{SomeSchemaList}).format("csv"), and we want to use > inferschema feature on those file path with colon or other special > characters, could you help fix this issue? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org