Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-12 Thread Cheng Lian

OK, I've merged this PR to master and branch-2.0.


On 8/11/16 8:27 AM, Cheng Lian wrote:
Haven't figured out the exactly way how it failed, but the leading 
underscore in the partition directory name looks suspicious. Could you 
please try this PR to see whether it fixes the issue: 
https://github.com/apache/spark/pull/14585/files


Cheng


On 8/9/16 5:38 PM, immerrr again wrote:

Another follow-up: I have narrowed it down to the first 32 partitions,
but from that point it gets strange.

Here's the error:

In [68]: spark.read.parquet(*subdirs[:32])
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It
must be specified manually;'


Removing *any* of the subdirs from that set removes the error.

In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] +
subdirs[i+1:32]))


Here's the punchline: schemas for the first 31 and for the last 31 of
those 32 subdirs are the same:

In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() ==
spark.read.parquet(*subdirs[1:32]).schema.jsonValue()
Out[70]: True

Any idea why that might be happening?

On Tue, Aug 9, 2016 at 12:12 PM, immerrr again  
wrote:

Some follow-up information:

- dataset size is ~150G

- the data is partitioned by one of the columns, _locality_code:
$ ls -1
_locality_code=AD
_locality_code=AE
_locality_code=AF
_locality_code=AG
_locality_code=AI
_locality_code=AL
_locality_code=AM
_locality_code=AN

_locality_code=YE
_locality_code=YT
_locality_code=YU
_locality_code=ZA
_locality_code=ZM
_locality_code=ZW
_SUCCESS

- some of the partitions contain only one row, but all partitions are
in place (ie number of directories matches number of distinct
localities
val counts = 
sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect()


scala> counts.slice(counts.length-10, counts.length)
res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255],
[AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467],
[UK,15375397], [BR,15829260], [IN,22404143], [US,98585175])

scala> counts.slice(0, 10)
res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1],
[WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115])


On Tue, Aug 9, 2016 at 11:10 AM, immerrr again  
wrote:

Hi everyone

I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
reading the existing data. Here's how the traceback looks in
spark-shell:

scala> spark.read.parquet("/path/to/data")
org.apache.spark.sql.AnalysisException: Unable to infer schema for
ParquetFormat at /path/to/data. It must be specified manually;
   at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
   at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)

   at scala.Option.getOrElse(Option.scala:121)
   at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
   at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
   at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) 

   at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) 


   ... 48 elided

If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
additionally see in the output:
https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
course, that same data is read and processed by spark-1.6.2 correctly.

Any idea what might be wrong here?

Cheers,
immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org







-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-10 Thread Cheng Lian
Haven't figured out the exactly way how it failed, but the leading 
underscore in the partition directory name looks suspicious. Could you 
please try this PR to see whether it fixes the issue: 
https://github.com/apache/spark/pull/14585/files


Cheng


On 8/9/16 5:38 PM, immerrr again wrote:

Another follow-up: I have narrowed it down to the first 32 partitions,
but from that point it gets strange.

Here's the error:

In [68]: spark.read.parquet(*subdirs[:32])
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It
must be specified manually;'


Removing *any* of the subdirs from that set removes the error.

In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] +
subdirs[i+1:32]))


Here's the punchline: schemas for the first 31 and for the last 31 of
those 32 subdirs are the same:

In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() ==
spark.read.parquet(*subdirs[1:32]).schema.jsonValue()
Out[70]: True

Any idea why that might be happening?

On Tue, Aug 9, 2016 at 12:12 PM, immerrr again  wrote:

Some follow-up information:

- dataset size is ~150G

- the data is partitioned by one of the columns, _locality_code:
$ ls -1
_locality_code=AD
_locality_code=AE
_locality_code=AF
_locality_code=AG
_locality_code=AI
_locality_code=AL
_locality_code=AM
_locality_code=AN

_locality_code=YE
_locality_code=YT
_locality_code=YU
_locality_code=ZA
_locality_code=ZM
_locality_code=ZW
_SUCCESS

- some of the partitions contain only one row, but all partitions are
in place (ie number of directories matches number of distinct
localities
val counts = 
sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect()

scala> counts.slice(counts.length-10, counts.length)
res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255],
[AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467],
[UK,15375397], [BR,15829260], [IN,22404143], [US,98585175])

scala> counts.slice(0, 10)
res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1],
[WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115])


On Tue, Aug 9, 2016 at 11:10 AM, immerrr again  wrote:

Hi everyone

I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
reading the existing data. Here's how the traceback looks in
spark-shell:

scala> spark.read.parquet("/path/to/data")
org.apache.spark.sql.AnalysisException: Unable to infer schema for
ParquetFormat at /path/to/data. It must be specified manually;
   at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
   at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
   at scala.Option.getOrElse(Option.scala:121)
   at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
   ... 48 elided

If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
additionally see in the output:
https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
course, that same data is read and processed by spark-1.6.2 correctly.

Any idea what might be wrong here?

Cheers,
immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Another follow-up: I have narrowed it down to the first 32 partitions,
but from that point it gets strange.

Here's the error:

In [68]: spark.read.parquet(*subdirs[:32])
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It
must be specified manually;'


Removing *any* of the subdirs from that set removes the error.

In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] +
subdirs[i+1:32]))


Here's the punchline: schemas for the first 31 and for the last 31 of
those 32 subdirs are the same:

In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() ==
spark.read.parquet(*subdirs[1:32]).schema.jsonValue()
Out[70]: True

Any idea why that might be happening?

On Tue, Aug 9, 2016 at 12:12 PM, immerrr again  wrote:
> Some follow-up information:
>
> - dataset size is ~150G
>
> - the data is partitioned by one of the columns, _locality_code:
> $ ls -1
> _locality_code=AD
> _locality_code=AE
> _locality_code=AF
> _locality_code=AG
> _locality_code=AI
> _locality_code=AL
> _locality_code=AM
> _locality_code=AN
> 
> _locality_code=YE
> _locality_code=YT
> _locality_code=YU
> _locality_code=ZA
> _locality_code=ZM
> _locality_code=ZW
> _SUCCESS
>
> - some of the partitions contain only one row, but all partitions are
> in place (ie number of directories matches number of distinct
> localities
> val counts = 
> sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect()
>
> scala> counts.slice(counts.length-10, counts.length)
> res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255],
> [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467],
> [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175])
>
> scala> counts.slice(0, 10)
> res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1],
> [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115])
>
>
> On Tue, Aug 9, 2016 at 11:10 AM, immerrr again  wrote:
>> Hi everyone
>>
>> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
>> reading the existing data. Here's how the traceback looks in
>> spark-shell:
>>
>> scala> spark.read.parquet("/path/to/data")
>> org.apache.spark.sql.AnalysisException: Unable to infer schema for
>> ParquetFormat at /path/to/data. It must be specified manually;
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>>   at scala.Option.getOrElse(Option.scala:121)
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
>>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
>>   ... 48 elided
>>
>> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
>> additionally see in the output:
>> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
>> course, that same data is read and processed by spark-1.6.2 correctly.
>>
>> Any idea what might be wrong here?
>>
>> Cheers,
>> immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Some follow-up information:

- dataset size is ~150G

- the data is partitioned by one of the columns, _locality_code:
$ ls -1
_locality_code=AD
_locality_code=AE
_locality_code=AF
_locality_code=AG
_locality_code=AI
_locality_code=AL
_locality_code=AM
_locality_code=AN

_locality_code=YE
_locality_code=YT
_locality_code=YU
_locality_code=ZA
_locality_code=ZM
_locality_code=ZW
_SUCCESS

- some of the partitions contain only one row, but all partitions are
in place (ie number of directories matches number of distinct
localities
val counts = 
sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect()

scala> counts.slice(counts.length-10, counts.length)
res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255],
[AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467],
[UK,15375397], [BR,15829260], [IN,22404143], [US,98585175])

scala> counts.slice(0, 10)
res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1],
[WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115])


On Tue, Aug 9, 2016 at 11:10 AM, immerrr again  wrote:
> Hi everyone
>
> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
> reading the existing data. Here's how the traceback looks in
> spark-shell:
>
> scala> spark.read.parquet("/path/to/data")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for
> ParquetFormat at /path/to/data. It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
>   ... 48 elided
>
> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
> additionally see in the output:
> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
> course, that same data is read and processed by spark-1.6.2 correctly.
>
> Any idea what might be wrong here?
>
> Cheers,
> immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Hi everyone

I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
reading the existing data. Here's how the traceback looks in
spark-shell:

scala> spark.read.parquet("/path/to/data")
org.apache.spark.sql.AnalysisException: Unable to infer schema for
ParquetFormat at /path/to/data. It must be specified manually;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
  ... 48 elided

If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
additionally see in the output:
https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
course, that same data is read and processed by spark-1.6.2 correctly.

Any idea what might be wrong here?

Cheers,
immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org