[ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheng Lian resolved SPARK-14343. -------------------------------- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13431 [https://github.com/apache/spark/pull/13431] > Dataframe operations on a partitioned dataset (using partition discovery) > return invalid results > ------------------------------------------------------------------------------------------------ > > Key: SPARK-14343 > URL: https://issues.apache.org/jira/browse/SPARK-14343 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.1, 2.0.0 > Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS > Reporter: Jurriaan Pruis > Assignee: Cheng Lian > Priority: Critical > Fix For: 2.0.0 > > > When reading a dataset using {{sqlContext.read.text()}} queries on the > partitioned column return invalid results. > h2. How to reproduce: > h3. Generate datasets > {code:title=repro.sh} > #!/bin/sh > mkdir -p dataset/year=2014 > mkdir -p dataset/year=2015 > echo "data from 2014" > dataset/year=2014/part01.txt > echo "data from 2015" > dataset/year=2015/part01.txt > {code} > {code:title=repro2.sh} > #!/bin/sh > mkdir -p dataset2/month=june > mkdir -p dataset2/month=july > echo "data from june" > dataset2/month=june/part01.txt > echo "data from july" > dataset2/month=july/part01.txt > {code} > h3. using first dataset > {code:none} > >>> df = sqlContext.read.text('dataset') > ... > >>> df > DataFrame[value: string, year: int] > >>> df.show() > +--------------+----+ > | value|year| > +--------------+----+ > |data from 2014|2014| > |data from 2015|2015| > +--------------+----+ > >>> df.select('year').show() > +----+ > |year| > +----+ > | 14| > | 14| > +----+ > {code} > This is clearly wrong. Seems like it returns the length of the value column? > h3. using second dataset > With another dataset it looks like this: > {code:none} > >>> df = sqlContext.read.text('dataset2') > >>> df > DataFrame[value: string, month: string] > >>> df.show() > +--------------+-----+ > | value|month| > +--------------+-----+ > |data from june| june| > |data from july| july| > +--------------+-----+ > >>> df.select('month').show() > +--------------+ > | month| > +--------------+ > |data from june| > |data from july| > +--------------+ > {code} > Here it returns the value of the value column instead of the month partition. > h3. Workaround > When I convert the dataframe to an RDD and back to a DataFrame I get the > following result (which is the expected behaviour): > {code:none} > >>> df.rdd.toDF().select('month').show() > +-----+ > |month| > +-----+ > | june| > | july| > +-----+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org