Dmitri Carpov created SPARK-17153: ------------------------------------- Summary: [Structured streams] readStream ignores partition columns Key: SPARK-17153 URL: https://issues.apache.org/jira/browse/SPARK-17153 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 2.0.0 Reporter: Dmitri Carpov
When parquet files are persisted using partitions, spark's `readStream` returns data with all `null`s for the partitioned columns. For example: ``` case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/mnt/databricks/test" data.write.partitionBy("id").parquet(url) ``` when data is read as stream: ``` spark.readStream.schema(spark.read.load(url).schema).parquet(url) ``` it reads: ``` id, value null, 1 null, 2 null, 3 ``` A possible reason is `readStream` reads parquet files directly but when those are stored the columns they are partitioned by are excluded from the file itself. In the given example the parquet files contain `value` information only since `id` is partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org