[ https://issues.apache.org/jira/browse/SPARK-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shixiong Zhu updated SPARK-17153: --------------------------------- Component/s: (was: DStreams) Structured Streaming > [Structured streams] readStream ignores partition columns > --------------------------------------------------------- > > Key: SPARK-17153 > URL: https://issues.apache.org/jira/browse/SPARK-17153 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 2.0.0 > Reporter: Dmitri Carpov > Assignee: Liang-Chi Hsieh > Labels: release_notes, releasenotes > Fix For: 2.0.2, 2.1.0 > > > When parquet files are persisted using partitions, spark's `readStream` > returns data with all `null`s for the partitioned columns. > For example: > {noformat} > case class A(id: Int, value: Int) > val data = spark.createDataset(Seq( > A(1, 1), > A(2, 2), > A(2, 3)) > ) > val url = "/mnt/databricks/test" > data.write.partitionBy("id").parquet(url) > {noformat} > when data is read as stream: > {noformat} > spark.readStream.schema(spark.read.load(url).schema).parquet(url) > {noformat} > it reads: > {noformat} > id, value > null, 1 > null, 2 > null, 3 > {noformat} > A possible reason is `readStream` reads parquet files directly but when those > are stored the columns they are partitioned by are excluded from the file > itself. In the given example the parquet files contain `value` information > only since `id` is partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org