Joining by values
I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Joining by values
This is my design. Now let me try and code it in Spark. rdd1.txt =1~4,5,6,72~4,53~6,7 rdd2.txt 4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010 TRANSFORM 1===map each value to key (like an inverted index)4~15~16~17~15~24~26~37~3 TRANSFORM 2===Join keys in transform 1 and rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010 TRANSFORM 3===Split key in transform 2 with ~ and keep key(1) i.e. 1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010 TRANSFORM 4===join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010 From: dcmovva dilip.mo...@gmail.com To: user@spark.apache.org Sent: Saturday, January 3, 2015 10:10 AM Subject: Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Joining by values
hi Take a look at the code here I wrotehttps://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala /*rdd1.txt 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 */ val sconf = new SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin) val sc = new SparkContext(sconf) val rdd1 = /path/to/rdd1.txt val rdd2 = /path/to/rdd2.txt val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, str._1)) val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), str.split('~')(1))) rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().collect().foreach(println) This outputs the following . I think this may be essentially what u r looking for(I have to understand how to NOT print as CompactBuffer)(2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID To: dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org Sent: Saturday, January 3, 2015 12:19 PM Subject: Re: Joining by values This is my design. Now let me try and code it in Spark. rdd1.txt =1~4,5,6,72~4,53~6,7 rdd2.txt 4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010 TRANSFORM 1===map each value to key (like an inverted index)4~15~16~17~15~24~26~37~3 TRANSFORM 2===Join keys in transform 1 and rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010 TRANSFORM 3===Split key in transform 2 with ~ and keep key(1) i.e. 1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010 TRANSFORM 4===join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010 From: dcmovva dilip.mo...@gmail.com To: user@spark.apache.org Sent: Saturday, January 3, 2015 10:10 AM Subject: Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Joining by values
call `map(_.toList)` to convert `CompactBuffer` to `List` Best Regards, Shixiong Zhu 2015-01-04 12:08 GMT+08:00 Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid: hi Take a look at the code here I wrote https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala /*rdd1.txt 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 */ val sconf = new SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin) val sc = new SparkContext(sconf) val rdd1 = /path/to/rdd1.txt val rdd2 = /path/to/rdd2.txt val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, str._1)) val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), str.split('~')(1))) rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().collect().foreach(println) This outputs the following . I think this may be essentially what u r looking for (I have to understand how to NOT print as CompactBuffer) (2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) -- *From:* Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID *To:* dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org *Sent:* Saturday, January 3, 2015 12:19 PM *Subject:* Re: Joining by values This is my design. Now let me try and code it in Spark. rdd1.txt = 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 TRANSFORM 1 === map each value to key (like an inverted index) 4~1 5~1 6~1 7~1 5~2 4~2 6~3 7~3 TRANSFORM 2 === Join keys in transform 1 and rdd2 4~1,1001,1000,1002,1003 4~2,1001,1000,1002,1003 5~1,1004,1001,1006,1007 5~2,1004,1001,1006,1007 6~1,1007,1009,1005,1008 6~3,1007,1009,1005,1008 7~1,1011,1012,1013,1010 7~3,1011,1012,1013,1010 TRANSFORM 3 === Split key in transform 2 with ~ and keep key(1) i.e. 1,2,3 1~1001,1000,1002,1003 2~1001,1000,1002,1003 1~1004,1001,1006,1007 2~1004,1001,1006,1007 1~1007,1009,1005,1008 3~1007,1009,1005,1008 1~1011,1012,1013,1010 3~1011,1012,1013,1010 TRANSFORM 4 === join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,1010 2~1001,1000,1002,1003,1004,1001,1006,1007 3~1007,1009,1005,1008,1011,1012,1013,1010 -- *From:* dcmovva dilip.mo...@gmail.com *To:* user@spark.apache.org *Sent:* Saturday, January 3, 2015 10:10 AM *Subject:* Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Joining by values
so I changed the code tordd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().map(str = (str._1,str._2.toList)).collect().foreach(println) Now it prints. Don't worry I will work on this to not output as List(...) But I am hoping that the JOIN question that @Dilip asked is hopefully answered :-) (2,List(1001,1000,1002,1003, 1004,1001,1006,1007))(3,List(1011,1012,1013,1010, 1007,1009,1005,1008))(1,List(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) From: Shixiong Zhu zsxw...@gmail.com To: Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org Sent: Saturday, January 3, 2015 8:15 PM Subject: Re: Joining by values call `map(_.toList)` to convert `CompactBuffer` to `List` Best Regards,Shixiong Zhu 2015-01-04 12:08 GMT+08:00 Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid: hi Take a look at the code here I wrotehttps://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala /*rdd1.txt 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 */ val sconf = new SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin) val sc = new SparkContext(sconf) val rdd1 = /path/to/rdd1.txt val rdd2 = /path/to/rdd2.txt val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, str._1)) val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), str.split('~')(1))) rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().collect().foreach(println) This outputs the following . I think this may be essentially what u r looking for(I have to understand how to NOT print as CompactBuffer)(2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) From: Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID To: dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org Sent: Saturday, January 3, 2015 12:19 PM Subject: Re: Joining by values This is my design. Now let me try and code it in Spark. rdd1.txt =1~4,5,6,72~4,53~6,7 rdd2.txt 4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010 TRANSFORM 1===map each value to key (like an inverted index)4~15~16~17~15~24~26~37~3 TRANSFORM 2===Join keys in transform 1 and rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010 TRANSFORM 3===Split key in transform 2 with ~ and keep key(1) i.e. 1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010 TRANSFORM 4===join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010 From: dcmovva dilip.mo...@gmail.com To: user@spark.apache.org Sent: Saturday, January 3, 2015 10:10 AM Subject: Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Joining by values
Thanks Sanjay. I will give it a try. Thanks Dilip On Sat, Jan 3, 2015 at 11:25 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: so I changed the code to rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().map(str = (str._1,str._2.toList)).collect().foreach(println) Now it prints. Don't worry I will work on this to not output as List(...) But I am hoping that the JOIN question that @Dilip asked is hopefully answered :-) (2,List(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,List(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,List(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) -- *From:* Shixiong Zhu zsxw...@gmail.com *To:* Sanjay Subramanian sanjaysubraman...@yahoo.com *Cc:* dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org *Sent:* Saturday, January 3, 2015 8:15 PM *Subject:* Re: Joining by values call `map(_.toList)` to convert `CompactBuffer` to `List` Best Regards, Shixiong Zhu 2015-01-04 12:08 GMT+08:00 Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid: hi Take a look at the code here I wrote https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala /*rdd1.txt 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 */ val sconf = new SparkConf().setMaster(local).setAppName(MedicalSideFx-PairRddJoin) val sc = new SparkContext(sconf) val rdd1 = /path/to/rdd1.txt val rdd2 = /path/to/rdd2.txt val rdd1InvIndex = sc.textFile(rdd1).map(x = (x.split('~')(0), x.split('~')(1))).flatMapValues(str = str.split(',')).map(str = (str._2, str._1)) val rdd2Pair = sc.textFile(rdd2).map(str = (str.split('~')(0), str.split('~')(1))) rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().collect().foreach(println) This outputs the following . I think this may be essentially what u r looking for (I have to understand how to NOT print as CompactBuffer) (2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007)) (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) -- *From:* Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID *To:* dcmovva dilip.mo...@gmail.com; user@spark.apache.org user@spark.apache.org *Sent:* Saturday, January 3, 2015 12:19 PM *Subject:* Re: Joining by values This is my design. Now let me try and code it in Spark. rdd1.txt = 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007 6~1007,1009,1005,1008 7~1011,1012,1013,1010 TRANSFORM 1 === map each value to key (like an inverted index) 4~1 5~1 6~1 7~1 5~2 4~2 6~3 7~3 TRANSFORM 2 === Join keys in transform 1 and rdd2 4~1,1001,1000,1002,1003 4~2,1001,1000,1002,1003 5~1,1004,1001,1006,1007 5~2,1004,1001,1006,1007 6~1,1007,1009,1005,1008 6~3,1007,1009,1005,1008 7~1,1011,1012,1013,1010 7~3,1011,1012,1013,1010 TRANSFORM 3 === Split key in transform 2 with ~ and keep key(1) i.e. 1,2,3 1~1001,1000,1002,1003 2~1001,1000,1002,1003 1~1004,1001,1006,1007 2~1004,1001,1006,1007 1~1007,1009,1005,1008 3~1007,1009,1005,1008 1~1011,1012,1013,1010 3~1011,1012,1013,1010 TRANSFORM 4 === join by key 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,1010 2~1001,1000,1002,1003,1004,1001,1006,1007 3~1007,1009,1005,1008,1011,1012,1013,1010 -- *From:* dcmovva dilip.mo...@gmail.com *To:* user@spark.apache.org *Sent:* Saturday, January 3, 2015 10:10 AM *Subject:* Joining by values I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 - [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) (2 - [1000,1001,1002,1003,1004,1006,1007]) (3 - [1005,1007,1008,1009,1010,1011,1012,1013]) Can someone suggest me how to do this. Thanks Dilip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org