[ https://issues.apache.org/jira/browse/SPARK-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitri Carpov updated SPARK-17153: ---------------------------------- Description: When parquet files are persisted using partitions, spark's `readStream` returns data with all `null`s for the partitioned columns. For example: {noformat} case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/mnt/databricks/test" data.write.partitionBy("id").parquet(url) {noformat} when data is read as stream: {noformat} spark.readStream.schema(spark.read.load(url).schema).parquet(url) {noformat} it reads: {noformat} id, value null, 1 null, 2 null, 3 {noformat} A possible reason is `readStream` reads parquet files directly but when those are stored the columns they are partitioned by are excluded from the file itself. In the given example the parquet files contain `value` information only since `id` is partition. was: When parquet files are persisted using partitions, spark's `readStream` returns data with all `null`s for the partitioned columns. For example: ``` case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/mnt/databricks/test" data.write.partitionBy("id").parquet(url) ``` when data is read as stream: ``` spark.readStream.schema(spark.read.load(url).schema).parquet(url) ``` it reads: ``` id, value null, 1 null, 2 null, 3 ``` A possible reason is `readStream` reads parquet files directly but when those are stored the columns they are partitioned by are excluded from the file itself. In the given example the parquet files contain `value` information only since `id` is partition. > [Structured streams] readStream ignores partition columns > --------------------------------------------------------- > > Key: SPARK-17153 > URL: https://issues.apache.org/jira/browse/SPARK-17153 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 2.0.0 > Reporter: Dmitri Carpov > > When parquet files are persisted using partitions, spark's `readStream` > returns data with all `null`s for the partitioned columns. > For example: > {noformat} > case class A(id: Int, value: Int) > val data = spark.createDataset(Seq( > A(1, 1), > A(2, 2), > A(2, 3)) > ) > val url = "/mnt/databricks/test" > data.write.partitionBy("id").parquet(url) > {noformat} > when data is read as stream: > {noformat} > spark.readStream.schema(spark.read.load(url).schema).parquet(url) > {noformat} > it reads: > {noformat} > id, value > null, 1 > null, 2 > null, 3 > {noformat} > A possible reason is `readStream` reads parquet files directly but when those > are stored the columns they are partitioned by are excluded from the file > itself. In the given example the parquet files contain `value` information > only since `id` is partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org