[jira] [Commented] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

Jurriaan Pruis (JIRA) Sun, 17 Apr 2016 07:15:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244669#comment-15244669
 ]


Jurriaan Pruis commented on SPARK-14343:
----------------------------------------

On the spark 2.0.0 nightly build it doesn't work at all:

{code:none}
>>> df=sqlContext.read.text('dataset')
16/04/17 16:11:34 INFO HDFSFileCatalog: Listing file:/Users/.../dataset on 
driver
16/04/17 16:11:34 INFO HDFSFileCatalog: Listing 
file:/Users/.../dataset/year=2014 on driver
16/04/17 16:11:34 INFO HDFSFileCatalog: Listing 
file:/Users/.../dataset/year=2015 on driver
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/Users/.../Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
 line 245, in text
    return 
self._df(self._jreader.text(self._sqlContext._sc._jvm.PythonUtils.toSeq(paths)))
  File 
"/Users/.../Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py",
 line 836, in __call__
  File 
"/Users/.../Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7/python/pyspark/sql/utils.py",
 line 57, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Try to map struct<value:string,year:int> 
to Tuple1, but failed as the number of fields does not line up.\n - Input 
schema: struct<value:string,year:int>\n - Target schema: struct<value:string>;'
{code}

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14343
>                 URL: https://issues.apache.org/jira/browse/SPARK-14343
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1, 2.0.0
>         Environment: Mac OS X 10.11.4
>            Reporter: Jurriaan Pruis
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--------------+----+
> |         value|year|
> +--------------+----+
> |data from 2014|2014|
> |data from 2015|2015|
> +--------------+----+
> >>> df.select('year').show()
> +----+
> |year|
> +----+
> |  14|
> |  14|
> +----+
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--------------+-----+
> |         value|month|
> +--------------+-----+
> |data from june| june|
> |data from july| july|
> +--------------+-----+
> >>> df.select('month').show()
> +--------------+
> |         month|
> +--------------+
> |data from june|
> |data from july|
> +--------------+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-----+
> |month|
> +-----+
> | june|
> | july|
> +-----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

Reply via email to