Hi Dave,
I had the same question and was wondering if you had found a way to do the
join without causing a shuffle?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Streaming-and-partitioning-tp25955p28425.html
Sent from the Apache Spark User List
on the RDD
partitions i.e. check the first entry in each partition to determine the
partition number of the data.
Thank you in advance for any help on this issue.
Dave.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Streaming-and-partitioning
t;>> Spark will do a shuffle under the hood in this case and the join will
>>> take
>>> place. The join will do its best to run on a node that has local access
>>> to
>>> the reference data RDD.
>>>
>>> Is there any difference between
;
>>>> I have two ways to do this.
>>>> 1. Explicitly call PartitionBy(CutomParitioner) on the input stream RDD
>>>> followed by a join. This results in a shuffle of the input stream RDD
>>>> and
>>>> then the co-partitioned join to t
on the RDD
> partitions i.e. check the first entry in each partition to determine the
> partition number of the data.
>
> Thank you in advance for any help on this issue.
> Dave.
>
>
>
> --
> View this message in context:
> http:/
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Streaming-and-partitioning-tp25955.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.
-
To
o an already created RDD and not to do a shuffle.
>> Spark in this case trusts that the data is setup correctly (as in the use
>> case above) and simply fills in the necessary meta data on the RDD
>> partitions i.e. check the first entry in each partition to determine the
>&g
nce for any help on this issue.
Dave.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Streaming-and-partitioning-tp25955.html
Sent from the Apache Spark User List mailing list