Hi, this is interesting, can you please share the code for this and if possible the source schema and it will be great if you could kindly share a sample file.
Regards, Gourav Sengupta On Tue, Nov 20, 2018 at 9:50 AM Michael Shtelma <mshte...@gmail.com> wrote: > > You can also cache the data frame on disk, if it does not fit into memory. > An alternative would be to write out data frame as parquet and then read > it, you can check if in this case the whole pipeline works faster as with > the standard cache. > > Best, > Michael > > > On Tue, Nov 20, 2018 at 9:14 AM Dipl.-Inf. Rico Bergmann < > i...@ricobergmann.de> wrote: > >> Hi! >> >> Thanks Vadim for your answer. But this would be like caching the dataset, >> right? Or is checkpointing faster then persisting to memory or disk? >> >> I attach a pdf of my dataflow program. If I could compute the output of >> outputs 1-5 in parallel the output of flatmap1 and groupBy could be reused, >> avoiding to write to disk (at least until the grouping). >> >> Any other ideas or proposals? >> >> Best, >> >> Rico. >> >> Am 19.11.2018 um 19:12 schrieb Vadim Semenov: >> >> You can use checkpointing, in this case Spark will write out an rdd to >> whatever destination you specify, and then the RDD can be reused from the >> checkpointed state avoiding recomputing. >> >> On Mon, Nov 19, 2018 at 7:51 AM Dipl.-Inf. Rico Bergmann < >> i...@ricobergmann.de> wrote: >> >>> Thanks for your advise. But I'm using Batch processing. Does anyone have >>> a solution for the batch processing case? >>> >>> Best, >>> >>> Rico. >>> >>> Am 19.11.2018 um 09:43 schrieb Magnus Nilsson: >>> >>> >>> Magnus Nilsson >>> 9:43 AM (0 minutes ago) >>> >>> to info >>> I had the same requirements. As far as I know the only way is to extend >>> the foreachwriter, cache the microbatch result and write to each output. >>> >>> >>> https://docs.databricks.com/spark/latest/structured-streaming/foreach.html >>> >>> Unfortunately it seems as if you have to make a new connection "per >>> batch" instead of creating one long lasting connections for the pipeline as >>> such. Ie you might have to implement some sort of connection pooling by >>> yourself depending on sink. >>> >>> Regards, >>> >>> Magnus >>> >>> >>> On Mon, Nov 19, 2018 at 9:13 AM Dipl.-Inf. Rico Bergmann < >>> i...@ricobergmann.de> wrote: >>> >>>> Hi! >>>> >>>> I have a SparkSQL programm, having one input and 6 ouputs (write). When >>>> executing this programm every call to write(.) executes the plan. My >>>> problem is, that I want all these writes to happen in parallel (inside >>>> one execution plan), because all writes have a common and compute >>>> intensive subpart, that can be shared by all plans. Is there a >>>> possibility to do this? (Caching is not a solution because the input >>>> dataset is way to large...) >>>> >>>> Hoping for advises ... >>>> >>>> Best, Rico B. >>>> >>>> >>>> --- >>>> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft. >>>> https://www.avast.com/antivirus >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >>> >>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >>> Virenfrei. >>> www.avast.com >>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >>> <#m_-5039401061051454276_m_-1099009014531121604_m_-7118895712672043959_m_6471921890789606388_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >> >> -- >> Sent from my iPhone >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >