Another follow-up: I have narrowed it down to the first 32 partitions, but from that point it gets strange.
Here's the error: In [68]: spark.read.parquet(*subdirs[:32]) ... AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be specified manually;' Removing *any* of the subdirs from that set removes the error. In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] + subdirs[i+1:32])) Here's the punchline: schemas for the first 31 and for the last 31 of those 32 subdirs are the same: In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() == spark.read.parquet(*subdirs[1:32]).schema.jsonValue() Out[70]: True Any idea why that might be happening? On Tue, Aug 9, 2016 at 12:12 PM, immerrr again <imme...@gmail.com> wrote: > Some follow-up information: > > - dataset size is ~150G > > - the data is partitioned by one of the columns, _locality_code: > $ ls -1 > _locality_code=AD > _locality_code=AE > _locality_code=AF > _locality_code=AG > _locality_code=AI > _locality_code=AL > _locality_code=AM > _locality_code=AN > .... > _locality_code=YE > _locality_code=YT > _locality_code=YU > _locality_code=ZA > _locality_code=ZM > _locality_code=ZW > _SUCCESS > > - some of the partitions contain only one row, but all partitions are > in place (ie number of directories matches number of distinct > localities > val counts = > sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect() > > scala> counts.slice(counts.length-10, counts.length) > res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255], > [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467], > [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175]) > > scala> counts.slice(0, 10) > res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1], > [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115]) > > > On Tue, Aug 9, 2016 at 11:10 AM, immerrr again <imme...@gmail.com> wrote: >> Hi everyone >> >> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue >> reading the existing data. Here's how the traceback looks in >> spark-shell: >> >> scala> spark.read.parquet("/path/to/data") >> org.apache.spark.sql.AnalysisException: Unable to infer schema for >> ParquetFormat at /path/to/data. It must be specified manually; >> at >> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) >> at >> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) >> at scala.Option.getOrElse(Option.scala:121) >> at >> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) >> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) >> at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) >> at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) >> ... 48 elided >> >> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I >> additionally see in the output: >> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of >> course, that same data is read and processed by spark-1.6.2 correctly. >> >> Any idea what might be wrong here? >> >> Cheers, >> immerrr --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org