Re: Hi all,
Hi Jean, We prepare the data for all another jobs. We have a lot of jobs that schedule to different time but all of them need to read same raw data. On Fri, Nov 3, 2017 at 12:49 PM Jean Georges Perrinwrote: > Hi Oren, > > Why don’t you want to use a GroupBy? You can cache or checkpoint the > result and use it in your process, keeping everything in Spark and avoiding > save/ingestion... > > > > On Oct 31, 2017, at 08:17, אורן שמון <oren.sha...@gmail.com> wrote: > > > > I have 2 spark jobs one is pre-process and the second is the process. > > Process job needs to calculate for each user in the data. > > I want to avoid shuffle like groupBy so I think about to save the > result of the pre-process as bucket by user in Parquet or to re-partition > by user and save the result . > > > > What is prefer ? and why > > Thanks in advance, > > Oren > >
Re: Hi all,
Hi Oren, Why don’t you want to use a GroupBy? You can cache or checkpoint the result and use it in your process, keeping everything in Spark and avoiding save/ingestion... > On Oct 31, 2017, at 08:17, אורן שמון <oren.sha...@gmail.com> wrote: > > I have 2 spark jobs one is pre-process and the second is the process. > Process job needs to calculate for each user in the data. > I want to avoid shuffle like groupBy so I think about to save the result of > the pre-process as bucket by user in Parquet or to re-partition by user and > save the result . > > What is prefer ? and why > Thanks in advance, > Oren - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi all,
I have 2 spark jobs one is pre-process and the second is the process. Process job needs to calculate for each user in the data. I want to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the result . What is prefer ? and why Thanks in advance, Oren
hi all
Hi, I just wanted to say hi all to the Spark community. I'm developing some stuff right now using Spark (we've started very recently). As the API documentation of Spark is really really good, I like to get deeper knowledge of the internal stuff -you know, the goodies. Watching movies from Spark Summits helps, nevertheless I hope to learn a lot from reading this mailing list. Regrads, Pawel Szulc