Re: Need help for RDD/DF transformation.

Yong Zhang Thu, 30 Mar 2017 06:32:03 -0700

Unfortunately, I don't think there is any optimized way to do this. Maybe 
someone else can correct me, but in theory, there is no way other than a 
cartesian product of your 2 sides if you can not change the data.

Think about it, if you want to join between 2 different types (Array and Int in 
your case), Spark cannot do HashJoin, nor SortMergeJoin. In the RelationDB 
world, you have to do a NestedLoop, which is cartesian join in BigData world. 
After you create a cartesian product of both, then check if Array column 
contains the other column.

I saw a similar question answered in 
stackoverflow<http://stackoverflow.com/questions/41595099/spark-join-dataframe-column-with-an-array>,
 but the answer is NOT correct for serious case.

Assume the key never contains any duplicate strings:

scala> sc.version
res1: String = 1.6.3

scala> val df1 = Seq((1, "a"), (3, "a"), (5, "b")).toDF("key", "value")
df1: org.apache.spark.sql.DataFrame = [key: int, value: string]

scala> val df2 = Seq((Array(1,2,3), "other1"),(Array(4,5), 
"other2")).toDF("keys", "other")
df2: org.apache.spark.sql.DataFrame = [keys: array<int>, other: string]

scala> df1.show
+---+-----+
|key|value|
+---+-----+
|  1|    a|
|  3|    a|
|  5|    b|
+---+-----+

scala> df2.show
+---------+------+
|     keys| other|
+---------+------+
|[1, 2, 3]|other1|
|   [4, 5]|other2|
+---------+------+

scala> df1.join(df2, 
df2("keys").cast("string").contains(df1("key").cast("string"))).show
+---+-----+---------+------+
|key|value|     keys| other|
+---+-----+---------+------+
|  1|    a|[1, 2, 3]|other1|
|  3|    a|[1, 2, 3]|other1|
|  5|    b|   [4, 5]|other2|
+---+-----+---------+------+

This code looks like working in Spark 1.6.x, but in fact, it has serious issue. 
As it assumes that the key will never have any conflict string in it, but this 
won't be true for any serious data.
So [11,2,3] contains "1" will return true, but is broken in your logic.

And it won't work in Spark 2.x, as the cast("string") logic changes for Array 
in Spark 2.x. The idea behind whole thing is to transfer your Array field to a 
String type, and use contains method to check if it contains another field (In 
String type too). But this is impossible to match the Array.contains(element) 
logic in most cases.

You need to know your data, then try to see if you can find any optimized way 
to avoid cartesian product. For example, maybe make sure "key" in DF1, always 
guarantee presenting the first element of the Array in a logic order, so you 
can just pick the first element out from the Array "keys" of DF2, to join. 
Otherwise, I don't see any way to avoid a cartesian join.

Yong

________________________________
From: Mungeol Heo <mungeol....@gmail.com>
Sent: Thursday, March 30, 2017 3:05 AM
To: ayan guha
Cc: Yong Zhang; user@spark.apache.org
Subject: Re: Need help for RDD/DF transformation.

Hello ayan,

Same key will not exists in different lists.
Which means, If "1" exists in a list, then it will not be presented in
another list.

Thank you.

On Thu, Mar 30, 2017 at 3:56 PM, ayan guha <guha.a...@gmail.com> wrote:
> Is it possible for one key in 2 groups in rdd2?
>
> [1,2,3]
> [1,4,5]
>
> ?
>
> On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo <mungeol....@gmail.com> wrote:
>>
>> Hello Yong,
>>
>> First of all, thank your attention.
>> Note that the values of elements, which have values at RDD/DF1, in the
>> same list will be always same.
>> Therefore, the "1" and "3", which from RDD/DF 1, will always have the
>> same value which is "a".
>>
>> The goal here is assigning same value to elements of the list which
>> does not exist in RDD/DF 1.
>> So, all the elements in the same list can have same value.
>>
>> Or, the final RDD/DF also can be like this,
>>
>> [1, 2, 3], a
>> [4, 5], b
>>
>> Thank you again.
>>
>> - Mungeol
>>
>>
>> On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang <java8...@hotmail.com> wrote:
>> > What is the desired result for
>> >
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, c
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> >
>> > Yong
>> >
>> > ________________________________
>> > From: Mungeol Heo <mungeol....@gmail.com>
>> > Sent: Wednesday, March 29, 2017 5:37 AM
>> > To: user@spark.apache.org
>> > Subject: Need help for RDD/DF transformation.
>> >
>> > Hello,
>> >
>> > Suppose, I have two RDD or data frame like addressed below.
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, a
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> > I need to create a new RDD/DF like below from RDD/DF 1 and 2.
>> >
>> > 1, a
>> > 2, a
>> > 3, a
>> > 4, b
>> > 5, b
>> >
>> > Is there an efficient way to do this?
>> > Any help will be great.
>> >
>> > Thank you.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
> --
> Best Regards,
> Ayan Guha

Re: Need help for RDD/DF transformation.

Reply via email to