Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
Another follow-up: I have narrowed it down to the first 32 partitions, but from that point it gets strange. Here's the error: In [68]: spark.read.parquet(*subdirs[:32]) ... AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be specified manually;' Removing *any* of the subdirs from that set removes the error. In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] + subdirs[i+1:32])) Here's the punchline: schemas for the first 31 and for the last 31 of those 32 subdirs are the same: In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() == spark.read.parquet(*subdirs[1:32]).schema.jsonValue() Out[70]: True Any idea why that might be happening? On Tue, Aug 9, 2016 at 12:12 PM, immerrr again wrote: > Some follow-up information: > > - dataset size is ~150G > > - the data is partitioned by one of the columns, _locality_code: > $ ls -1 > _locality_code=AD > _locality_code=AE > _locality_code=AF > _locality_code=AG > _locality_code=AI > _locality_code=AL > _locality_code=AM > _locality_code=AN > > _locality_code=YE > _locality_code=YT > _locality_code=YU > _locality_code=ZA > _locality_code=ZM > _locality_code=ZW > _SUCCESS > > - some of the partitions contain only one row, but all partitions are > in place (ie number of directories matches number of distinct > localities > val counts = > sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect() > > scala> counts.slice(counts.length-10, counts.length) > res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255], > [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467], > [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175]) > > scala> counts.slice(0, 10) > res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1], > [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115]) > > > On Tue, Aug 9, 2016 at 11:10 AM, immerrr again wrote: >> Hi everyone >> >> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue >> reading the existing data. Here's how the traceback looks in >> spark-shell: >> >> scala> spark.read.parquet("/path/to/data") >> org.apache.spark.sql.AnalysisException: Unable to infer schema for >> ParquetFormat at /path/to/data. It must be specified manually; >> at >> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) >> at >> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) >> at scala.Option.getOrElse(Option.scala:121) >> at >> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) >> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) >> at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) >> at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) >> ... 48 elided >> >> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I >> additionally see in the output: >> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of >> course, that same data is read and processed by spark-1.6.2 correctly. >> >> Any idea what might be wrong here? >> >> Cheers, >> immerrr - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
Some follow-up information: - dataset size is ~150G - the data is partitioned by one of the columns, _locality_code: $ ls -1 _locality_code=AD _locality_code=AE _locality_code=AF _locality_code=AG _locality_code=AI _locality_code=AL _locality_code=AM _locality_code=AN _locality_code=YE _locality_code=YT _locality_code=YU _locality_code=ZA _locality_code=ZM _locality_code=ZW _SUCCESS - some of the partitions contain only one row, but all partitions are in place (ie number of directories matches number of distinct localities val counts = sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect() scala> counts.slice(counts.length-10, counts.length) res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255], [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467], [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175]) scala> counts.slice(0, 10) res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1], [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115]) On Tue, Aug 9, 2016 at 11:10 AM, immerrr again wrote: > Hi everyone > > I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue > reading the existing data. Here's how the traceback looks in > spark-shell: > > scala> spark.read.parquet("/path/to/data") > org.apache.spark.sql.AnalysisException: Unable to infer schema for > ParquetFormat at /path/to/data. It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) > at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) > ... 48 elided > > If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I > additionally see in the output: > https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of > course, that same data is read and processed by spark-1.6.2 correctly. > > Any idea what might be wrong here? > > Cheers, > immerrr - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
Hi everyone I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue reading the existing data. Here's how the traceback looks in spark-shell: scala> spark.read.parquet("/path/to/data") org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /path/to/data. It must be specified manually; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) ... 48 elided If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I additionally see in the output: https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of course, that same data is read and processed by spark-1.6.2 correctly. Any idea what might be wrong here? Cheers, immerrr - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
pyspark: dataframe.take is slow
Hi all! I'm having a strange issue with pyspark 1.6.1. I have a dataframe, df = sqlContext.read.parquet('/path/to/data') whose "df.take(10)" is really slow, apparently scanning the whole dataset to take the first ten rows. "df.first()" works fast, as does "df.rdd.take(10)". I have found https://issues.apache.org/jira/browse/SPARK-10731 that should have fixed it in 1.6.0, but it has not. What am i doing wrong here and how can I fix this? Cheers, immerrr - To unsubscribe e-mail: user-unsubscr...@spark.apache.org