Re: coalesce on SchemaRDD in pyspark

Davies Liu Thu, 11 Sep 2014 23:31:29 -0700

This is a bug, I had create an issue to track this:
https://issues.apache.org/jira/browse/SPARK-3500


Also, there is PR to fix this: https://github.com/apache/spark/pull/2369

Before next bugfix release, you can workaround this by:

srdd = sqlCtx.jsonRDD(rdd)
srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)


On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller <bmill...@eecs.berkeley.edu> wrote:
> Hi All,
>
> I'm having some trouble with the coalesce and repartition functions for
> SchemaRDD objects in pyspark.  When I run:
>
> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}',
> '{"foo":"baz"}'])).coalesce(1)
>
> I get this error:
>
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class
> java.lang.Boolean]) does not exist
>
> For context, I have a dataset stored in a parquet file, and I'm using
> SQLContext to make several queries against the data.  I then register the
> results of these as queries new tables in the SQLContext.  Unfortunately
> each new table has the same number of partitions as the original (despite
> being much smaller).  Hence my interest in coalesce and repartition.
>
> Has anybody else encountered this bug?  Is there an alternate workflow I
> should consider?
>
> I am running the 1.1.0 binaries released today.
>
> best,
> -Brad

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: coalesce on SchemaRDD in pyspark

Reply via email to