Hi Jean,
We prepare the data for all another jobs. We have a lot of jobs that
schedule to different time but all of them need to read same raw data.
On Fri, Nov 3, 2017 at 12:49 PM Jean Georges Perrin
wrote:
> Hi Oren,
>
> Why don’t you want to use a GroupBy? You can cache or checkpoint the
> re
Hi all,
I have 2 spark jobs one is pre-process and the second is the process.
Process job needs to calculate for each user in the data.
I want to avoid shuffle like groupBy so I think about to save the result
of the pre-process as bucket by user in Parquet or to re-partition by user
and save the r
Hi all,
I have Parquet files as result from some job , the job saved them in bucket
mode by userId . How can I read the files in bucket mode in another job ? I
tried to read it but it didnt bucket the data (same user in same partition)
I have 2 spark jobs one is pre-process and the second is the process.
Process job needs to calculate for each user in the data.
I want to avoid shuffle like groupBy so I think about to save the result
of the pre-process as bucket by user in Parquet or to re-partition by user
and save the result .