Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Davies Liu
This is a bug, I had create an issue to track this:
https://issues.apache.org/jira/browse/SPARK-3500

Also, there is PR to fix this: https://github.com/apache/spark/pull/2369

Before next bugfix release, you can workaround this by:

srdd = sqlCtx.jsonRDD(rdd)
srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)


On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu wrote:
 Hi All,

 I'm having some trouble with the coalesce and repartition functions for
 SchemaRDD objects in pyspark.  When I run:

 sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}',
 '{foo:baz}'])).coalesce(1)

 I get this error:

 Py4JError: An error occurred while calling o94.coalesce. Trace:
 py4j.Py4JException: Method coalesce([class java.lang.Integer, class
 java.lang.Boolean]) does not exist

 For context, I have a dataset stored in a parquet file, and I'm using
 SQLContext to make several queries against the data.  I then register the
 results of these as queries new tables in the SQLContext.  Unfortunately
 each new table has the same number of partitions as the original (despite
 being much smaller).  Hence my interest in coalesce and repartition.

 Has anybody else encountered this bug?  Is there an alternate workflow I
 should consider?

 I am running the 1.1.0 binaries released today.

 best,
 -Brad

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Brad Miller
Hi Davies,

Thanks for the quick fix. I'm sorry to send out a bug report on release day
- 1.1.0 really is a great release.  I've been running the 1.1 branch for a
while and there's definitely lots of good stuff.

For the workaround, I think you may have meant:

srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx)

Note:
_schema_rdd - _jschema_rdd
false - False

That workaround seems to work fine (in that I've observed the correct
number of partitions in the web-ui, although haven't tested it any beyond
that).

Thanks!
-Brad

On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu dav...@databricks.com wrote:

 This is a bug, I had create an issue to track this:
 https://issues.apache.org/jira/browse/SPARK-3500

 Also, there is PR to fix this: https://github.com/apache/spark/pull/2369

 Before next bugfix release, you can workaround this by:

 srdd = sqlCtx.jsonRDD(rdd)
 srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)


 On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu
 wrote:
  Hi All,
 
  I'm having some trouble with the coalesce and repartition functions for
  SchemaRDD objects in pyspark.  When I run:
 
  sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}',
  '{foo:baz}'])).coalesce(1)
 
  I get this error:
 
  Py4JError: An error occurred while calling o94.coalesce. Trace:
  py4j.Py4JException: Method coalesce([class java.lang.Integer, class
  java.lang.Boolean]) does not exist
 
  For context, I have a dataset stored in a parquet file, and I'm using
  SQLContext to make several queries against the data.  I then register the
  results of these as queries new tables in the SQLContext.  Unfortunately
  each new table has the same number of partitions as the original (despite
  being much smaller).  Hence my interest in coalesce and repartition.
 
  Has anybody else encountered this bug?  Is there an alternate workflow I
  should consider?
 
  I am running the 1.1.0 binaries released today.
 
  best,
  -Brad



Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Davies Liu
On Fri, Sep 12, 2014 at 8:55 AM, Brad Miller bmill...@eecs.berkeley.edu wrote:
 Hi Davies,

 Thanks for the quick fix. I'm sorry to send out a bug report on release day
 - 1.1.0 really is a great release.  I've been running the 1.1 branch for a
 while and there's definitely lots of good stuff.

 For the workaround, I think you may have meant:

 srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx)

Yes, thanks for the correction.

 Note:
 _schema_rdd - _jschema_rdd
 false - False

 That workaround seems to work fine (in that I've observed the correct number
 of partitions in the web-ui, although haven't tested it any beyond that).

 Thanks!
 -Brad

 On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu dav...@databricks.com wrote:

 This is a bug, I had create an issue to track this:
 https://issues.apache.org/jira/browse/SPARK-3500

 Also, there is PR to fix this: https://github.com/apache/spark/pull/2369

 Before next bugfix release, you can workaround this by:

 srdd = sqlCtx.jsonRDD(rdd)
 srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)


 On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu
 wrote:
  Hi All,
 
  I'm having some trouble with the coalesce and repartition functions for
  SchemaRDD objects in pyspark.  When I run:
 
  sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}',
  '{foo:baz}'])).coalesce(1)
 
  I get this error:
 
  Py4JError: An error occurred while calling o94.coalesce. Trace:
  py4j.Py4JException: Method coalesce([class java.lang.Integer, class
  java.lang.Boolean]) does not exist
 
  For context, I have a dataset stored in a parquet file, and I'm using
  SQLContext to make several queries against the data.  I then register
  the
  results of these as queries new tables in the SQLContext.  Unfortunately
  each new table has the same number of partitions as the original
  (despite
  being much smaller).  Hence my interest in coalesce and repartition.
 
  Has anybody else encountered this bug?  Is there an alternate workflow I
  should consider?
 
  I am running the 1.1.0 binaries released today.
 
  best,
  -Brad



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org