Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
OK, I've merged this PR to master and branch-2.0. On 8/11/16 8:27 AM, Cheng Lian wrote: Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the issue: https://github.com/apache/spark/pull/14585/files Cheng On 8/9/16 5:38 PM, immerrr again wrote: Another follow-up: I have narrowed it down to the first 32 partitions, but from that point it gets strange. Here's the error: In [68]: spark.read.parquet(*subdirs[:32]) ... AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be specified manually;' Removing *any* of the subdirs from that set removes the error. In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] + subdirs[i+1:32])) Here's the punchline: schemas for the first 31 and for the last 31 of those 32 subdirs are the same: In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() == spark.read.parquet(*subdirs[1:32]).schema.jsonValue() Out[70]: True Any idea why that might be happening? On Tue, Aug 9, 2016 at 12:12 PM, immerrr againwrote: Some follow-up information: - dataset size is ~150G - the data is partitioned by one of the columns, _locality_code: $ ls -1 _locality_code=AD _locality_code=AE _locality_code=AF _locality_code=AG _locality_code=AI _locality_code=AL _locality_code=AM _locality_code=AN _locality_code=YE _locality_code=YT _locality_code=YU _locality_code=ZA _locality_code=ZM _locality_code=ZW _SUCCESS - some of the partitions contain only one row, but all partitions are in place (ie number of directories matches number of distinct localities val counts = sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect() scala> counts.slice(counts.length-10, counts.length) res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255], [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467], [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175]) scala> counts.slice(0, 10) res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1], [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115]) On Tue, Aug 9, 2016 at 11:10 AM, immerrr again wrote: Hi everyone I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue reading the existing data. Here's how the traceback looks in spark-shell: scala> spark.read.parquet("/path/to/data") org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /path/to/data. It must be specified manually; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) ... 48 elided If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I additionally see in the output: https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of course, that same data is read and processed by spark-1.6.2 correctly. Any idea what might be wrong here? Cheers, immerrr - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the issue: https://github.com/apache/spark/pull/14585/files Cheng On 8/9/16 5:38 PM, immerrr again wrote: Another follow-up: I have narrowed it down to the first 32 partitions, but from that point it gets strange. Here's the error: In [68]: spark.read.parquet(*subdirs[:32]) ... AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be specified manually;' Removing *any* of the subdirs from that set removes the error. In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] + subdirs[i+1:32])) Here's the punchline: schemas for the first 31 and for the last 31 of those 32 subdirs are the same: In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() == spark.read.parquet(*subdirs[1:32]).schema.jsonValue() Out[70]: True Any idea why that might be happening? On Tue, Aug 9, 2016 at 12:12 PM, immerrr againwrote: Some follow-up information: - dataset size is ~150G - the data is partitioned by one of the columns, _locality_code: $ ls -1 _locality_code=AD _locality_code=AE _locality_code=AF _locality_code=AG _locality_code=AI _locality_code=AL _locality_code=AM _locality_code=AN _locality_code=YE _locality_code=YT _locality_code=YU _locality_code=ZA _locality_code=ZM _locality_code=ZW _SUCCESS - some of the partitions contain only one row, but all partitions are in place (ie number of directories matches number of distinct localities val counts = sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect() scala> counts.slice(counts.length-10, counts.length) res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255], [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467], [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175]) scala> counts.slice(0, 10) res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1], [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115]) On Tue, Aug 9, 2016 at 11:10 AM, immerrr again wrote: Hi everyone I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue reading the existing data. Here's how the traceback looks in spark-shell: scala> spark.read.parquet("/path/to/data") org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /path/to/data. It must be specified manually; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) ... 48 elided If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I additionally see in the output: https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of course, that same data is read and processed by spark-1.6.2 correctly. Any idea what might be wrong here? Cheers, immerrr - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
Another follow-up: I have narrowed it down to the first 32 partitions, but from that point it gets strange. Here's the error: In [68]: spark.read.parquet(*subdirs[:32]) ... AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be specified manually;' Removing *any* of the subdirs from that set removes the error. In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] + subdirs[i+1:32])) Here's the punchline: schemas for the first 31 and for the last 31 of those 32 subdirs are the same: In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() == spark.read.parquet(*subdirs[1:32]).schema.jsonValue() Out[70]: True Any idea why that might be happening? On Tue, Aug 9, 2016 at 12:12 PM, immerrr againwrote: > Some follow-up information: > > - dataset size is ~150G > > - the data is partitioned by one of the columns, _locality_code: > $ ls -1 > _locality_code=AD > _locality_code=AE > _locality_code=AF > _locality_code=AG > _locality_code=AI > _locality_code=AL > _locality_code=AM > _locality_code=AN > > _locality_code=YE > _locality_code=YT > _locality_code=YU > _locality_code=ZA > _locality_code=ZM > _locality_code=ZW > _SUCCESS > > - some of the partitions contain only one row, but all partitions are > in place (ie number of directories matches number of distinct > localities > val counts = > sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect() > > scala> counts.slice(counts.length-10, counts.length) > res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255], > [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467], > [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175]) > > scala> counts.slice(0, 10) > res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1], > [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115]) > > > On Tue, Aug 9, 2016 at 11:10 AM, immerrr again wrote: >> Hi everyone >> >> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue >> reading the existing data. Here's how the traceback looks in >> spark-shell: >> >> scala> spark.read.parquet("/path/to/data") >> org.apache.spark.sql.AnalysisException: Unable to infer schema for >> ParquetFormat at /path/to/data. It must be specified manually; >> at >> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) >> at >> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) >> at scala.Option.getOrElse(Option.scala:121) >> at >> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) >> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) >> at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) >> at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) >> ... 48 elided >> >> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I >> additionally see in the output: >> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of >> course, that same data is read and processed by spark-1.6.2 correctly. >> >> Any idea what might be wrong here? >> >> Cheers, >> immerrr - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
Some follow-up information: - dataset size is ~150G - the data is partitioned by one of the columns, _locality_code: $ ls -1 _locality_code=AD _locality_code=AE _locality_code=AF _locality_code=AG _locality_code=AI _locality_code=AL _locality_code=AM _locality_code=AN _locality_code=YE _locality_code=YT _locality_code=YU _locality_code=ZA _locality_code=ZM _locality_code=ZW _SUCCESS - some of the partitions contain only one row, but all partitions are in place (ie number of directories matches number of distinct localities val counts = sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect() scala> counts.slice(counts.length-10, counts.length) res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255], [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467], [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175]) scala> counts.slice(0, 10) res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1], [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115]) On Tue, Aug 9, 2016 at 11:10 AM, immerrr againwrote: > Hi everyone > > I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue > reading the existing data. Here's how the traceback looks in > spark-shell: > > scala> spark.read.parquet("/path/to/data") > org.apache.spark.sql.AnalysisException: Unable to infer schema for > ParquetFormat at /path/to/data. It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) > at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) > ... 48 elided > > If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I > additionally see in the output: > https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of > course, that same data is read and processed by spark-1.6.2 correctly. > > Any idea what might be wrong here? > > Cheers, > immerrr - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
Hi everyone I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue reading the existing data. Here's how the traceback looks in spark-shell: scala> spark.read.parquet("/path/to/data") org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /path/to/data. It must be specified manually; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) ... 48 elided If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I additionally see in the output: https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of course, that same data is read and processed by spark-1.6.2 correctly. Any idea what might be wrong here? Cheers, immerrr - To unsubscribe e-mail: user-unsubscr...@spark.apache.org