Re: can't union two rdds
Rdd union will result in 1 2 3 4 5 6 7 8 9 10 11 12 What you are trying to do is join. There must be a logic/key to perform join operation. I think in your case you want the order (index) to be the joining key here. RDD is a distributed data structure and is not apt for your case. If that amount for data is less, you can use rdd.collect, just iterate on it both the list and produce the desired result -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22323.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: can't union two rdds
use zip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: UNION two RDDs
Hi Sean and Madhu, Thank you for the explanation. I really appreciate it. Best Regards, Jerry On Fri, Dec 19, 2014 at 4:50 AM, Sean Owen so...@cloudera.com wrote: coalesce actually changes the number of partitions. Unless the original RDD had just 1 partition, coalesce(1) will make an RDD with 1 partition that is larger than the original partitions, of course. I don't think the question is about ordering of things within an element of the RDD? If the original RDD was sorted, and so has a defined ordering, then it will be preserved. Otherwise I believe you do not have any guarantees about ordering. In practice, you may find that you still encounter the elements in the same order after coalesce(1), although I am not sure that is even true. union() is the same story; unless the RDDs are sorted I don't think there are guarantees. However I'm almost certain that in practice, as it happens now, A's elements would come before B's after a union, if you did traverse them. On Fri, Dec 19, 2014 at 5:41 AM, madhu phatak phatak@gmail.com wrote: Hi, coalesce is an operation which changes no of records in a partition. It will not touch ordering with in a row AFAIK. On Fri, Dec 19, 2014 at 2:22 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry -- Regards, Madhukara Phatak http://www.madhukaraphatak.com
Re: UNION two RDDs
coalesce actually changes the number of partitions. Unless the original RDD had just 1 partition, coalesce(1) will make an RDD with 1 partition that is larger than the original partitions, of course. I don't think the question is about ordering of things within an element of the RDD? If the original RDD was sorted, and so has a defined ordering, then it will be preserved. Otherwise I believe you do not have any guarantees about ordering. In practice, you may find that you still encounter the elements in the same order after coalesce(1), although I am not sure that is even true. union() is the same story; unless the RDDs are sorted I don't think there are guarantees. However I'm almost certain that in practice, as it happens now, A's elements would come before B's after a union, if you did traverse them. On Fri, Dec 19, 2014 at 5:41 AM, madhu phatak phatak@gmail.com wrote: Hi, coalesce is an operation which changes no of records in a partition. It will not touch ordering with in a row AFAIK. On Fri, Dec 19, 2014 at 2:22 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry -- Regards, Madhukara Phatak http://www.madhukaraphatak.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
UNION two RDDs
Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry