coalesce on SchemaRDD in pyspark

Brad Miller Thu, 11 Sep 2014 18:13:16 -0700

Hi All,

I'm having some trouble with the coalesce and repartition functions for
SchemaRDD objects in pyspark.  When I run:


sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}',
'{"foo":"baz"}'])).coalesce(1)

I get this error:

Py4JError: An error occurred while calling o94.coalesce. Trace:
py4j.Py4JException: Method coalesce([class java.lang.Integer, class
java.lang.Boolean]) does not exist

For context, I have a dataset stored in a parquet file, and I'm using
SQLContext to make several queries against the data.  I then register the
results of these as queries new tables in the SQLContext.  Unfortunately
each new table has the same number of partitions as the original (despite
being much smaller).  Hence my interest in coalesce and repartition.

Has anybody else encountered this bug?  Is there an alternate workflow I
should consider?

I am running the 1.1.0 binaries released today.

best,
-Brad

coalesce on SchemaRDD in pyspark

Reply via email to