[ https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415903#comment-15415903 ]
Dongjoon Hyun commented on SPARK-16975: --------------------------------------- Hi, [~immerrr]. I can not reproduce your situation, but could you change `_locality_code` into `locality_code`? > Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2 > -------------------------------------------------------------------------- > > Key: SPARK-16975 > URL: https://issues.apache.org/jira/browse/SPARK-16975 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.0.0 > Environment: Ubuntu Linux 14.04 > Reporter: immerrr again > Labels: parquet > > Spark-2.0.0 seems to have some problems reading a parquet dataset generated > by 1.6.2. > {code} > In [80]: spark.read.parquet('/path/to/data') > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data. It must be specified manually;' > {code} > The dataset is ~150G and partitioned by _locality_code column. None of the > partitions are empty. I have narrowed the failing dataset to the first 32 > partitions of the data: > {code} > In [82]: spark.read.parquet(*subdirs[:32]) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be > specified manually;' > {code} > Interestingly, it works OK if you remove any of the partitions from the list: > {code} > In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + > subdirs[i+1:32])) > {code} > Another strange thing is that the schemas for the first and the last 31 > partitions of the subset are identical: > {code} > In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == > spark.read.parquet(*subdirs[1:32]).schema.fields > Out[84]: True > {code} > Which got me interested and I tried this: > {code} > In [87]: spark.read.parquet(*([subdirs[0]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be > specified manually;' > In [88]: spark.read.parquet(*([subdirs[15]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be > specified manually;' > In [89]: spark.read.parquet(*([subdirs[31]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be > specified manually;' > {code} > If I read the first partition, save it in 2.0 and try to read in the same > manner, everything is fine: > {code} > In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test') > 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32)) > {code} > I have originally posted it to user mailing list, but with the last > discoveries this clearly seems like a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org