Another follow-up: I have narrowed it down to the first 32 partitions,
but from that point it gets strange.

Here's the error:

In [68]: spark.read.parquet(*subdirs[:32])
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It
must be specified manually;'


Removing *any* of the subdirs from that set removes the error.

In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] +
subdirs[i+1:32]))


Here's the punchline: schemas for the first 31 and for the last 31 of
those 32 subdirs are the same:

In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() ==
spark.read.parquet(*subdirs[1:32]).schema.jsonValue()
Out[70]: True

Any idea why that might be happening?

On Tue, Aug 9, 2016 at 12:12 PM, immerrr again <imme...@gmail.com> wrote:
> Some follow-up information:
>
> - dataset size is ~150G
>
> - the data is partitioned by one of the columns, _locality_code:
> $ ls -1
> _locality_code=AD
> _locality_code=AE
> _locality_code=AF
> _locality_code=AG
> _locality_code=AI
> _locality_code=AL
> _locality_code=AM
> _locality_code=AN
> ....
> _locality_code=YE
> _locality_code=YT
> _locality_code=YU
> _locality_code=ZA
> _locality_code=ZM
> _locality_code=ZW
> _SUCCESS
>
> - some of the partitions contain only one row, but all partitions are
> in place (ie number of directories matches number of distinct
> localities
> val counts = 
> sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect()
>
> scala> counts.slice(counts.length-10, counts.length)
> res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255],
> [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467],
> [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175])
>
> scala> counts.slice(0, 10)
> res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1],
> [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115])
>
>
> On Tue, Aug 9, 2016 at 11:10 AM, immerrr again <imme...@gmail.com> wrote:
>> Hi everyone
>>
>> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
>> reading the existing data. Here's how the traceback looks in
>> spark-shell:
>>
>> scala> spark.read.parquet("/path/to/data")
>> org.apache.spark.sql.AnalysisException: Unable to infer schema for
>> ParquetFormat at /path/to/data. It must be specified manually;
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>>   at scala.Option.getOrElse(Option.scala:121)
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
>>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
>>   ... 48 elided
>>
>> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
>> additionally see in the output:
>> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
>> course, that same data is read and processed by spark-1.6.2 correctly.
>>
>> Any idea what might be wrong here?
>>
>> Cheers,
>> immerrr

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to