[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-24068: --------------------------------- Fix Version/s: 2.3.1 > CSV schema inferring doesn't work for compressed files > ------------------------------------------------------ > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Maxim Gekk > Assignee: Maxim Gekk > Priority: Major > Fix For: 2.3.1, 2.4.0 > > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-----+ > |�LZO| > +-----+ > | a| > +-----+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +----+----+ > |col1|col2| > +----+----+ > | a| 1| > +----+----+ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org