You can use spark-sql to solve this usecase, and you don't need to have
800G of memory (but of course if you are caching the whole data into
memory, then you would need it.). You can persist the data by setting
DISK_AND_MEMORY_SER property if you don't want to bring whole data into
memory, in this case most of the data would reside on the disk and spark
will utilize it efficiently.

Thanks
Best Regards

On Fri, Oct 24, 2014 at 8:47 AM, jian.t <jian.tan...@gmail.com> wrote:

> Hello,
> I am new to Spark. I have a basic question about the memory requirement of
> using Spark.
>
> I need to join multiple data sources between multiple data sets. The join
> is
> not a straightforward join. The logic is more like: first join T1 on column
> A with T2, then for all the records that couldn't find the match in the
> Join, join T1 on column B with T2, and then join on C and son on. I was
> using HIVE, but it requires multiple scans on T1, which turns out slow.
>
> It seems like if I load T1 and T2 in memory using Spark, I could improve
> the
> performance. However, T1 and T2 totally is around 800G. Does that mean I
> need to have 800G memory (I don't have that amount of memory)? Or Spark
> could do something like streaming but then again will the performance
> sacrifice as a result?
>
>
>
> Thanks
> JT
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Memory-requirement-of-using-Spark-tp17177.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to