Thanks.

Since join will be done in regular basis in short period of time ( let say
20s) do you have any suggestions how to make it faster?

I am thinking of partitioning data set and cache it.

Rendy
On Apr 30, 2015 6:31 AM, "Tathagata Das" <t...@databricks.com> wrote:

> Have you taken a look at the join section in the streaming programming
> guide?
>
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins
>
> On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior <
> rendy.b.jun...@gmail.com> wrote:
>
>> Let say I have transaction data and visit data
>>
>> visit
>> | userId | Visit source | Timestamp |
>> | A      | google ads   | 1         |
>> | A      | facebook ads | 2         |
>>
>> transaction
>> | userId | total price | timestamp |
>> | A      | 100         | 248384    |
>> | B      | 200         | 43298739  |
>>
>> I want to join transaction data and visit data to do sales attribution. I
>> want to do it realtime whenever transaction occurs (streaming).
>>
>> Is it scalable to do join between one data and very big historical data
>> using join function in spark? If it is not, then how it usually be done?
>>
>> Visit needs to be historical, since visit can be anytime before
>> transaction (e.g. visit is one year before transaction occurs)
>>
>> Rendy
>>
>
>

Reply via email to