Re: Spark DataSets and multiple write(.) calls

Gourav Sengupta Tue, 20 Nov 2018 08:29:35 -0800

Hi,

 this is interesting, can you please share the code for this and if
possible the source schema and it will be great if you could kindly share a
sample file.



Regards,
Gourav Sengupta

On Tue, Nov 20, 2018 at 9:50 AM Michael Shtelma <mshte...@gmail.com> wrote:

>
> You can also cache the data frame on disk, if it does not fit into memory.
> An alternative would be to write out data frame as parquet and then read
> it, you can check if in this case the whole pipeline works faster as with
> the standard cache.
>
> Best,
> Michael
>
>
> On Tue, Nov 20, 2018 at 9:14 AM Dipl.-Inf. Rico Bergmann <
> i...@ricobergmann.de> wrote:
>
>> Hi!
>>
>> Thanks Vadim for your answer. But this would be like caching the dataset,
>> right? Or is checkpointing faster then persisting to memory or disk?
>>
>> I attach a pdf of my dataflow program. If I could compute the output of
>> outputs 1-5 in parallel the output of flatmap1 and groupBy could be reused,
>> avoiding to write to disk (at least until the grouping).
>>
>> Any other ideas or proposals?
>>
>> Best,
>>
>> Rico.
>>
>> Am 19.11.2018 um 19:12 schrieb Vadim Semenov:
>>
>> You can use checkpointing, in this case Spark will write out an rdd to
>> whatever destination you specify, and then the RDD can be reused from the
>> checkpointed state avoiding recomputing.
>>
>> On Mon, Nov 19, 2018 at 7:51 AM Dipl.-Inf. Rico Bergmann <
>> i...@ricobergmann.de> wrote:
>>
>>> Thanks for your advise. But I'm using Batch processing. Does anyone have
>>> a solution for the batch processing case?
>>>
>>> Best,
>>>
>>> Rico.
>>>
>>> Am 19.11.2018 um 09:43 schrieb Magnus Nilsson:
>>>
>>>
>>> Magnus Nilsson
>>> 9:43 AM (0 minutes ago)
>>>
>>> to info
>>> I had the same requirements. As far as I know the only way is to extend
>>> the foreachwriter, cache the microbatch result and write to each output.
>>>
>>>
>>> https://docs.databricks.com/spark/latest/structured-streaming/foreach.html
>>>
>>> Unfortunately it seems as if you have to make a new connection "per
>>> batch" instead of creating one long lasting connections for the pipeline as
>>> such. Ie you might have to implement some sort of connection pooling by
>>> yourself depending on sink.
>>>
>>> Regards,
>>>
>>> Magnus
>>>
>>>
>>> On Mon, Nov 19, 2018 at 9:13 AM Dipl.-Inf. Rico Bergmann <
>>> i...@ricobergmann.de> wrote:
>>>
>>>> Hi!
>>>>
>>>> I have a SparkSQL programm, having one input and 6 ouputs (write). When
>>>> executing this programm every call to write(.) executes the plan. My
>>>> problem is, that I want all these writes to happen in parallel (inside
>>>> one execution plan), because all writes have a common and compute
>>>> intensive subpart, that can be shared by all plans. Is there a
>>>> possibility to do this? (Caching is not a solution because the input
>>>> dataset is way to large...)
>>>>
>>>> Hoping for advises ...
>>>>
>>>> Best, Rico B.
>>>>
>>>>
>>>> ---
>>>> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
>>>> https://www.avast.com/antivirus
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>  Virenfrei.
>>> www.avast.com
>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>> <#m_-5039401061051454276_m_-1099009014531121604_m_-7118895712672043959_m_6471921890789606388_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>> --
>> Sent from my iPhone
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark DataSets and multiple write(.) calls

Reply via email to