Re: coalesce on SchemaRDD in pyspark
This is a bug, I had create an issue to track this: https://issues.apache.org/jira/browse/SPARK-3500 Also, there is PR to fix this: https://github.com/apache/spark/pull/2369 Before next bugfix release, you can workaround this by: srdd = sqlCtx.jsonRDD(rdd) srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx) On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I'm having some trouble with the coalesce and repartition functions for SchemaRDD objects in pyspark. When I run: sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}', '{foo:baz}'])).coalesce(1) I get this error: Py4JError: An error occurred while calling o94.coalesce. Trace: py4j.Py4JException: Method coalesce([class java.lang.Integer, class java.lang.Boolean]) does not exist For context, I have a dataset stored in a parquet file, and I'm using SQLContext to make several queries against the data. I then register the results of these as queries new tables in the SQLContext. Unfortunately each new table has the same number of partitions as the original (despite being much smaller). Hence my interest in coalesce and repartition. Has anybody else encountered this bug? Is there an alternate workflow I should consider? I am running the 1.1.0 binaries released today. best, -Brad - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: coalesce on SchemaRDD in pyspark
Hi Davies, Thanks for the quick fix. I'm sorry to send out a bug report on release day - 1.1.0 really is a great release. I've been running the 1.1 branch for a while and there's definitely lots of good stuff. For the workaround, I think you may have meant: srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx) Note: _schema_rdd - _jschema_rdd false - False That workaround seems to work fine (in that I've observed the correct number of partitions in the web-ui, although haven't tested it any beyond that). Thanks! -Brad On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu dav...@databricks.com wrote: This is a bug, I had create an issue to track this: https://issues.apache.org/jira/browse/SPARK-3500 Also, there is PR to fix this: https://github.com/apache/spark/pull/2369 Before next bugfix release, you can workaround this by: srdd = sqlCtx.jsonRDD(rdd) srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx) On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I'm having some trouble with the coalesce and repartition functions for SchemaRDD objects in pyspark. When I run: sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}', '{foo:baz}'])).coalesce(1) I get this error: Py4JError: An error occurred while calling o94.coalesce. Trace: py4j.Py4JException: Method coalesce([class java.lang.Integer, class java.lang.Boolean]) does not exist For context, I have a dataset stored in a parquet file, and I'm using SQLContext to make several queries against the data. I then register the results of these as queries new tables in the SQLContext. Unfortunately each new table has the same number of partitions as the original (despite being much smaller). Hence my interest in coalesce and repartition. Has anybody else encountered this bug? Is there an alternate workflow I should consider? I am running the 1.1.0 binaries released today. best, -Brad
Re: coalesce on SchemaRDD in pyspark
On Fri, Sep 12, 2014 at 8:55 AM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi Davies, Thanks for the quick fix. I'm sorry to send out a bug report on release day - 1.1.0 really is a great release. I've been running the 1.1 branch for a while and there's definitely lots of good stuff. For the workaround, I think you may have meant: srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx) Yes, thanks for the correction. Note: _schema_rdd - _jschema_rdd false - False That workaround seems to work fine (in that I've observed the correct number of partitions in the web-ui, although haven't tested it any beyond that). Thanks! -Brad On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu dav...@databricks.com wrote: This is a bug, I had create an issue to track this: https://issues.apache.org/jira/browse/SPARK-3500 Also, there is PR to fix this: https://github.com/apache/spark/pull/2369 Before next bugfix release, you can workaround this by: srdd = sqlCtx.jsonRDD(rdd) srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx) On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I'm having some trouble with the coalesce and repartition functions for SchemaRDD objects in pyspark. When I run: sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}', '{foo:baz}'])).coalesce(1) I get this error: Py4JError: An error occurred while calling o94.coalesce. Trace: py4j.Py4JException: Method coalesce([class java.lang.Integer, class java.lang.Boolean]) does not exist For context, I have a dataset stored in a parquet file, and I'm using SQLContext to make several queries against the data. I then register the results of these as queries new tables in the SQLContext. Unfortunately each new table has the same number of partitions as the original (despite being much smaller). Hence my interest in coalesce and repartition. Has anybody else encountered this bug? Is there an alternate workflow I should consider? I am running the 1.1.0 binaries released today. best, -Brad - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org