Thanks. Since join will be done in regular basis in short period of time ( let say 20s) do you have any suggestions how to make it faster?
I am thinking of partitioning data set and cache it. Rendy On Apr 30, 2015 6:31 AM, "Tathagata Das" <t...@databricks.com> wrote: > Have you taken a look at the join section in the streaming programming > guide? > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins > > On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior < > rendy.b.jun...@gmail.com> wrote: > >> Let say I have transaction data and visit data >> >> visit >> | userId | Visit source | Timestamp | >> | A | google ads | 1 | >> | A | facebook ads | 2 | >> >> transaction >> | userId | total price | timestamp | >> | A | 100 | 248384 | >> | B | 200 | 43298739 | >> >> I want to join transaction data and visit data to do sales attribution. I >> want to do it realtime whenever transaction occurs (streaming). >> >> Is it scalable to do join between one data and very big historical data >> using join function in spark? If it is not, then how it usually be done? >> >> Visit needs to be historical, since visit can be anytime before >> transaction (e.g. visit is one year before transaction occurs) >> >> Rendy >> > >