[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

immerrr again (JIRA) Wed, 10 Aug 2016 13:39:36 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415961#comment-15415961
 ]


immerrr again commented on SPARK-16975:
---------------------------------------

oh, that's unfortunate. coming from python world, underscore seems a natural 
prefix for "internal things".

what bugs me, though, is that spark2.0 had no problems reading up to 31 
directories starting with underscores and bugged out only when there were 32 of 
them.

and i'll try the rename, give me a sec..

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --------------------------------------------------------------------------
>
>                 Key: SPARK-16975
>                 URL: https://issues.apache.org/jira/browse/SPARK-16975
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.0
>         Environment: Ubuntu Linux 14.04
>            Reporter: immerrr again
>              Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

Reply via email to