Hi Jean, We prepare the data for all another jobs. We have a lot of jobs that schedule to different time but all of them need to read same raw data.
On Fri, Nov 3, 2017 at 12:49 PM Jean Georges Perrin <jper...@lumeris.com> wrote: > Hi Oren, > > Why don’t you want to use a GroupBy? You can cache or checkpoint the > result and use it in your process, keeping everything in Spark and avoiding > save/ingestion... > > > > On Oct 31, 2017, at 08:17, אורן שמון <oren.sha...@gmail.com> wrote: > > > > I have 2 spark jobs one is pre-process and the second is the process. > > Process job needs to calculate for each user in the data. > > I want to avoid shuffle like groupBy so I think about to save the > result of the pre-process as bucket by user in Parquet or to re-partition > by user and save the result . > > > > What is prefer ? and why > > Thanks in advance, > > Oren > >