Re: The differentce between SparkSql/DataFram join and Rdd join

Michael Armbrust Tue, 07 Apr 2015 11:01:55 -0700

The joins here are totally different implementations, but it is worrisome
that you are seeing the SQL join hanging.  Can you provide more information
about the hang?  jstack of the driver and a worker that is processing a
task would be very useful.


On Tue, Apr 7, 2015 at 8:33 AM, Hao Ren <inv...@gmail.com> wrote:

> Hi,
>
> We have 2 hive tables and want to join one with the other.
>
> Initially, we ran a sql request on HiveContext. But it did not work. It was
> blocked on 30/600 tasks.
> Then we tried to load tables into two DataFrames, we have encountered the
> same problem.
> Finally, it works with RDD.join. What we have done is basically
> transforming
> 2 tables into 2 pair RDDs, then calling a join operation. It works great in
> about 500 s.
>
> However, workaround is just a workaround, since we have to transform hive
> tables into RDD. This is really annoying.
>
> Just wondering whether the underlying code of DF/SQL's join operation is
> the
> same as rdd's, knowing that there is a syntax analysis layer for DF/SQL,
> while RDD's join is straightforward on two pair RDDs.
>
> SQL request:
> ----------------------------------------------------------------------
> select v1.receipt_id, v1.sku, v1.amount, v1.qty, v2.discount
> from table1 as v1 left join table2 as v2
> on v1.receipt_id = v2.receipt_id
> where v1.sku != ""
>
> DataFrame:
>
> -----------------------------------------------------------------------------------------
> val rdd1 = ss.hiveContext.table(table1)
> val rdd1Filt = rdd1.filter(rdd1.col("sku") !== "")
> val rdd2 = ss.hiveContext.table(table2)
> val rddJoin = rdd1Filt.join(rdd2, rdd1Filt("receipt_id") ===
> rdd2("receipt_id"))
> rddJoin.saveAsTable("testJoinTable", SaveMode.Overwrite)
>
> RDD workaround in this case is a bit cumbersome, for short, we just created
> 2 RDDs, join, and then apply a new schema on the result RDD. This approach
> works, at least all tasks were finished, while the DF/SQL approach don't.
>
> Any idea ?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: The differentce between SparkSql/DataFram join and Rdd join

Reply via email to