[ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-14343.
--------------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.0

Issue resolved by pull request 13431
[https://github.com/apache/spark/pull/13431]

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14343
>                 URL: https://issues.apache.org/jira/browse/SPARK-14343
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1, 2.0.0
>         Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>            Reporter: Jurriaan Pruis
>            Assignee: Cheng Lian
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--------------+----+
> |         value|year|
> +--------------+----+
> |data from 2014|2014|
> |data from 2015|2015|
> +--------------+----+
> >>> df.select('year').show()
> +----+
> |year|
> +----+
> |  14|
> |  14|
> +----+
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--------------+-----+
> |         value|month|
> +--------------+-----+
> |data from june| june|
> |data from july| july|
> +--------------+-----+
> >>> df.select('month').show()
> +--------------+
> |         month|
> +--------------+
> |data from june|
> |data from july|
> +--------------+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-----+
> |month|
> +-----+
> | june|
> | july|
> +-----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to