Re: Parametrisable output metadata path

Wojciech Indyk Mon, 17 Apr 2023 08:19:22 -0700

Hi Jungtaek,
integration with Delta Lake is not an option to me, I raised a PR for
improvement of FileStreamSink with the new parameter:
https://github.com/apache/spark/pull/40821. Can you please take a look?


--
Kind regards/ Pozdrawiam,
Wojciech Indyk


niedz., 16 kwi 2023 o 04:45 Jungtaek Lim <[email protected]>
napisał(a):

> Hi,
>
> We have been indicated with lots of issues with the current FileStream
> sink. The effort to fix these issues are quite significant, and it ended up
> with derivation of "Data Lake" products.
>
> I'd recommend not to fix the issue but leave it as its limitation, and
> integrate your workload with Data Lake products. For a full disclaimer, I
> work in Databricks so I might be biased, but even when I was working at the
> previous employer which didn't have the Data Lake product at that time, I
> also had to agree that there are too many things to fix, and the effort
> would be fully redundant with existing products.
>
> Maybe, it might be helpful to have an "at-least-once" version of
> FileStream sink, where a metadata directory is no longer needed. It may
> require the implementation to go back to the old way of atomic renaming,
> but it will also get rid of the necessity of a metadata directory, so
> someone might find it useful. For end-to-end exactly once, people can
> either use a limited current FileStream sink or use Data Lake products. I
> don't see the value in making improvements to the current FileStream sink.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk <[email protected]>
> wrote:
>
>> Hi!
>> I raised a ticket on parametrisable output metadata path
>> https://issues.apache.org/jira/browse/SPARK-43152.
>> I am going to raise a PR against it and I realised, that this relatively
>> simple change impacts on method hasMetadata(path), that would have a new
>> meaning if we can define custom path for metadata of output files. Can you
>> please share your opinion on  how the custom output metadata path can
>> impact on design of structured streaming?
>> E.g. I can see one case when I set a parameter of output metadata path,
>> run a job on output path A, stop the job, change the output path to B and
>> hasMetadata works well. If you have any corner case in mind where the
>> parametrised output metadata path can break something please describe it.
>>
>> --
>> Kind regards/ Pozdrawiam,
>> Wojciech Indyk
>>
>

Re: Parametrisable output metadata path

Reply via email to