Re: Coalesce behaviour

2018-11-19 Thread Sergey Zhemzhitsky
o not have any parent dependencies and always return >>>>>>>> an empty iterator. >>>>>>>> >>>>>>>> I believe this should work as desired (at least the previous >>>>>>>> ShuffleMapStage will think tha

Re: Coalesce behaviour

2018-10-15 Thread Koert Kuipers
nged). >>>>>>> >>>>>>> There are few issues though - existence of empty partitions which >>>>>>> can be evaluated almost for free and empty output files from these empty >>>>>>> partitons which can be beaten by m

Re: Coalesce behaviour

2018-10-14 Thread Jörn Franke
hat the number of partitons in the next >>>>>>> stage, it generates shuffle output for, is not changed). >>>>>>> >>>>>>> There are few issues though - existence of empty partitions which can >>>>>>> be

Re: Coalesce behaviour

2018-10-14 Thread Koert Kuipers
be beaten by means of LazyOutputFormat in case of >>>>>> RDDs. >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 8, 2018, 23:57 Koert Kuipers wrote: >>>>>> >>>>>>> although i person

Re: Coalesce behaviour

2018-10-14 Thread Wenchen Fan
: >>>>> >>>>>> although i personally would describe this as a bug the answer will be >>>>>> that this is the intended behavior. the coalesce "infects" the shuffle >>>>>> before it, making a coalesce useless for reducing o

Re: Coalesce behaviour

2018-10-13 Thread Koert Kuipers
put files after a >>>>> shuffle with many partitions b design. >>>>> >>>>> your only option left is a repartition for which you pay the price in >>>>> that it introduces another expensive shuffle. >>>>> >>>>&g

Re: Coalesce behaviour

2018-10-13 Thread Sergey Zhemzhitsky
;>>> that this is the intended behavior. the coalesce "infects" the shuffle >>>>> before it, making a coalesce useless for reducing output files after a >>>>> shuffle with many partitions b design. >>>>> >>>>> your o

Re: Coalesce behaviour

2018-10-12 Thread Wenchen Fan
g output files after a >>>> shuffle with many partitions b design. >>>> >>>> your only option left is a repartition for which you pay the price in >>>> that it introduces another expensive shuffle. >>>> >>>> interesting

Re: Coalesce behaviour

2018-10-12 Thread Koert Kuipers
hich you pay the price in >>> that it introduces another expensive shuffle. >>> >>> interestingly if you do a coalesce on a map-only job it knows how to >>> reduce the partitions and output files without introducing a shuffle, so >>> clearly it is possi

Re: Coalesce behaviour

2018-10-12 Thread Sergey Zhemzhitsky
t, making a coalesce useless for reducing output files after a > >>> shuffle with many partitions b design. > >>> > >>> your only option left is a repartition for which you pay the price in > >>> that it introduces another expensive shuffle. > &

Re: Coalesce behaviour

2018-10-12 Thread Sergey Zhemzhitsky
eft is a repartition for which you pay the price in that >>> it introduces another expensive shuffle. >>> >>> interestingly if you do a coalesce on a map-only job it knows how to reduce >>> the partitions and output files without introducing a shuffle, so clear

Re: Coalesce behaviour

2018-10-10 Thread Wenchen Fan
utput files without introducing a shuffle, so >> clearly it is possible, but i dont know how to get this behavior after a >> shuffle in an existing job. >> >> On Fri, Oct 5, 2018 at 6:34 PM Sergey Zhemzhitsky >> wrote: >> >>> Hello guys, >>> >

Re: Coalesce behaviour

2018-10-09 Thread Sergey Zhemzhitsky
o get this behavior after a > shuffle in an existing job. > > On Fri, Oct 5, 2018 at 6:34 PM Sergey Zhemzhitsky > wrote: > >> Hello guys, >> >> Currently I'm a little bit confused with coalesce behaviour. >> >> Consider the following usecase - I'

Re: Coalesce behaviour

2018-10-08 Thread Koert Kuipers
ffle in an existing job. On Fri, Oct 5, 2018 at 6:34 PM Sergey Zhemzhitsky wrote: > Hello guys, > > Currently I'm a little bit confused with coalesce behaviour. > > Consider the following usecase - I'd like to join two pretty big RDDs. > To make a join more stable and t

Coalesce behaviour

2018-10-05 Thread Sergey Zhemzhitsky
Hello guys, Currently I'm a little bit confused with coalesce behaviour. Consider the following usecase - I'd like to join two pretty big RDDs. To make a join more stable and to prevent it from failures by OOM RDDs are usually repartitioned to redistribute data more evenly and to pre