Hi Jungtaek, integration with Delta Lake is not an option to me, I raised a PR for improvement of FileStreamSink with the new parameter: https://github.com/apache/spark/pull/40821. Can you please take a look?
-- Kind regards/ Pozdrawiam, Wojciech Indyk niedz., 16 kwi 2023 o 04:45 Jungtaek Lim <kabhwan.opensou...@gmail.com> napisał(a): > Hi, > > We have been indicated with lots of issues with the current FileStream > sink. The effort to fix these issues are quite significant, and it ended up > with derivation of "Data Lake" products. > > I'd recommend not to fix the issue but leave it as its limitation, and > integrate your workload with Data Lake products. For a full disclaimer, I > work in Databricks so I might be biased, but even when I was working at the > previous employer which didn't have the Data Lake product at that time, I > also had to agree that there are too many things to fix, and the effort > would be fully redundant with existing products. > > Maybe, it might be helpful to have an "at-least-once" version of > FileStream sink, where a metadata directory is no longer needed. It may > require the implementation to go back to the old way of atomic renaming, > but it will also get rid of the necessity of a metadata directory, so > someone might find it useful. For end-to-end exactly once, people can > either use a limited current FileStream sink or use Data Lake products. I > don't see the value in making improvements to the current FileStream sink. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk <wojciechin...@gmail.com> > wrote: > >> Hi! >> I raised a ticket on parametrisable output metadata path >> https://issues.apache.org/jira/browse/SPARK-43152. >> I am going to raise a PR against it and I realised, that this relatively >> simple change impacts on method hasMetadata(path), that would have a new >> meaning if we can define custom path for metadata of output files. Can you >> please share your opinion on how the custom output metadata path can >> impact on design of structured streaming? >> E.g. I can see one case when I set a parameter of output metadata path, >> run a job on output path A, stop the job, change the output path to B and >> hasMetadata works well. If you have any corner case in mind where the >> parametrised output metadata path can break something please describe it. >> >> -- >> Kind regards/ Pozdrawiam, >> Wojciech Indyk >> >