Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Another follow-up: I have narrowed it down to the first 32 partitions,
but from that point it gets strange.

Here's the error:

In [68]: spark.read.parquet(*subdirs[:32])
...
AnalysisException: u'Unable to infer schema for ParquetFormat at
/path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It
must be specified manually;'


Removing *any* of the subdirs from that set removes the error.

In [69]: for i in range(32): spark.read.parquet(*(subdirs[:i] +
subdirs[i+1:32]))


Here's the punchline: schemas for the first 31 and for the last 31 of
those 32 subdirs are the same:

In [70]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() ==
spark.read.parquet(*subdirs[1:32]).schema.jsonValue()
Out[70]: True

Any idea why that might be happening?

On Tue, Aug 9, 2016 at 12:12 PM, immerrr again <imme...@gmail.com> wrote:
> Some follow-up information:
>
> - dataset size is ~150G
>
> - the data is partitioned by one of the columns, _locality_code:
> $ ls -1
> _locality_code=AD
> _locality_code=AE
> _locality_code=AF
> _locality_code=AG
> _locality_code=AI
> _locality_code=AL
> _locality_code=AM
> _locality_code=AN
> 
> _locality_code=YE
> _locality_code=YT
> _locality_code=YU
> _locality_code=ZA
> _locality_code=ZM
> _locality_code=ZW
> _SUCCESS
>
> - some of the partitions contain only one row, but all partitions are
> in place (ie number of directories matches number of distinct
> localities
> val counts = 
> sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect()
>
> scala> counts.slice(counts.length-10, counts.length)
> res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255],
> [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467],
> [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175])
>
> scala> counts.slice(0, 10)
> res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1],
> [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115])
>
>
> On Tue, Aug 9, 2016 at 11:10 AM, immerrr again <imme...@gmail.com> wrote:
>> Hi everyone
>>
>> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
>> reading the existing data. Here's how the traceback looks in
>> spark-shell:
>>
>> scala> spark.read.parquet("/path/to/data")
>> org.apache.spark.sql.AnalysisException: Unable to infer schema for
>> ParquetFormat at /path/to/data. It must be specified manually;
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>>   at scala.Option.getOrElse(Option.scala:121)
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
>>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
>>   ... 48 elided
>>
>> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
>> additionally see in the output:
>> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
>> course, that same data is read and processed by spark-1.6.2 correctly.
>>
>> Any idea what might be wrong here?
>>
>> Cheers,
>> immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Some follow-up information:

- dataset size is ~150G

- the data is partitioned by one of the columns, _locality_code:
$ ls -1
_locality_code=AD
_locality_code=AE
_locality_code=AF
_locality_code=AG
_locality_code=AI
_locality_code=AL
_locality_code=AM
_locality_code=AN

_locality_code=YE
_locality_code=YT
_locality_code=YU
_locality_code=ZA
_locality_code=ZM
_locality_code=ZW
_SUCCESS

- some of the partitions contain only one row, but all partitions are
in place (ie number of directories matches number of distinct
localities
val counts = 
sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect()

scala> counts.slice(counts.length-10, counts.length)
res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255],
[AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467],
[UK,15375397], [BR,15829260], [IN,22404143], [US,98585175])

scala> counts.slice(0, 10)
res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1],
[WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115])


On Tue, Aug 9, 2016 at 11:10 AM, immerrr again <imme...@gmail.com> wrote:
> Hi everyone
>
> I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
> reading the existing data. Here's how the traceback looks in
> spark-shell:
>
> scala> spark.read.parquet("/path/to/data")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for
> ParquetFormat at /path/to/data. It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
>   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
>   ... 48 elided
>
> If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
> additionally see in the output:
> https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
> course, that same data is read and processed by spark-1.6.2 correctly.
>
> Any idea what might be wrong here?
>
> Cheers,
> immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Hi everyone

I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue
reading the existing data. Here's how the traceback looks in
spark-shell:

scala> spark.read.parquet("/path/to/data")
org.apache.spark.sql.AnalysisException: Unable to infer schema for
ParquetFormat at /path/to/data. It must be specified manually;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411)
  ... 48 elided

If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I
additionally see in the output:
https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of
course, that same data is read and processed by spark-1.6.2 correctly.

Any idea what might be wrong here?

Cheers,
immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



pyspark: dataframe.take is slow

2016-07-05 Thread immerrr again
Hi all!

I'm having a strange issue with pyspark 1.6.1. I have a dataframe,

df = sqlContext.read.parquet('/path/to/data')

whose "df.take(10)" is really slow, apparently scanning the whole
dataset to take the first ten rows. "df.first()" works fast, as does
"df.rdd.take(10)".

I have found https://issues.apache.org/jira/browse/SPARK-10731 that
should have fixed it in 1.6.0, but it has not. What am i doing wrong
here and how can I fix this?

Cheers,
immerrr

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org