bump to see anyone interested or concerned about this. On Tue, Aug 25, 2020 at 4:56 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote:
> Bump this again. > > On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Bump again. >> >> Unlike file stream sink which has lots of limitations and many of us have >> been suggesting alternatives, file stream source is the only way if end >> users want to read the data from files. No alternative unless they >> introduce another ETL & storage (probably Kafka). >> >> On Fri, Jul 31, 2020 at 3:06 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Hi German, >>> >>> option 1 isn't about "deleting" the old files, as your input directory >>> may be accessed by multiple queries. Kafka centralizes the maintenance of >>> input data hence possible to apply retention without problem. >>> option 1 is more about "hiding" the old files being read, so that end >>> users "may" be able to delete the files once they ensure "all queries >>> accessing the input directory" don't see the old files. >>> >>> On Fri, Jul 31, 2020 at 2:57 PM German Schiavon < >>> gschiavonsp...@gmail.com> wrote: >>> >>>> HI Jungtaek, >>>> >>>> I have a question, aren't both approaches compatible? >>>> >>>> How I see it, I think It would be interesting to have a retention >>>> period to delete old files and/or the possibility of indicating an offset >>>> (Timestamp). It would be very "similar" to how we do it with kafka. >>>> >>>> WDYT? >>>> >>>> On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim < >>>> kabhwan.opensou...@gmail.com> wrote: >>>> >>>>> (I'd like to keep the discussion thread focusing on the specific topic >>>>> - let's initiate another discussion threads on different topics.) >>>>> >>>>> Thanks for the input. I'd like to emphasize that the point in >>>>> discussion is the "latestFirst" option - the rationalization starts from >>>>> growing metadata log issues. I hope your input is picking option 2, but >>>>> could you please make clear your input represents OK to "replace" the >>>>> "latestFirst" option with "starting from timestamp"? >>>>> >>>>> >>>>> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal < >>>>> vikram.agra...@gmail.com> wrote: >>>>> >>>>>> If we compare file-stream source with other streaming sources such as >>>>>> Kafka, the current behavior is indeed incomplete. Starting the streaming >>>>>> from a custom offset/particular point of time is something that is >>>>>> missing. >>>>>> Typically filestream sources don't have auto-deletion of the older >>>>>> data/files. In kafka we can define the retention period. So even if we >>>>>> use >>>>>> "Earliest" we won't end up reading from the time when the Kafka topic was >>>>>> created. On the other hand, streaming sources can hold very old files. >>>>>> It's >>>>>> very valid use-cases to read the bulk of the old files using a batch job >>>>>> until a particular timestamp. And then use streaming jobs for real-time >>>>>> updates. >>>>>> >>>>>> So having support where we can specify a timestamp. and we would >>>>>> consider files created post that timestamp can be useful. >>>>>> >>>>>> Another concern which we need to consider is the listing cost. is >>>>>> there any way we can avoid listing the entire base directory and then >>>>>> filtering out the new files. if the data is organized as partitions using >>>>>> date, will it help to list only those partitions where new files were >>>>>> added? >>>>>> >>>>>> >>>>>> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim < >>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>> >>>>>>> bump, is there any interest on this topic? >>>>>>> >>>>>>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim < >>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>> >>>>>>>> (Just to add rationalization, you can refer the original mail >>>>>>>> thread on dev@ list to see efforts on addressing problems in file >>>>>>>> stream source / sink - >>>>>>>> https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E >>>>>>>> ) >>>>>>>> >>>>>>>> On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim < >>>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi devs, >>>>>>>>> >>>>>>>>> As I have been going through the various issues on metadata log >>>>>>>>> growing, it's not only the issue of sink, but also the issue of >>>>>>>>> source. >>>>>>>>> Unlike sink metadata log which entries should be available to the >>>>>>>>> readers, the source metadata log is only for the streaming query >>>>>>>>> starting >>>>>>>>> from the checkpoint, hence in theory it should only memorize about >>>>>>>>> minimal entries which prevent processing multiple times on the same >>>>>>>>> file. >>>>>>>>> >>>>>>>>> This is not applied to the file stream source, and I think it's >>>>>>>>> because of the existence of the "latestFirst" option which I haven't >>>>>>>>> seen >>>>>>>>> from any sources. The option works as reading files in "backward" >>>>>>>>> order, >>>>>>>>> which means Spark can read the oldest file and latest file together >>>>>>>>> in a >>>>>>>>> micro-batch, which ends up having to memorize all files previously >>>>>>>>> read. >>>>>>>>> The option can be changed during query restart, so even if the query >>>>>>>>> is >>>>>>>>> started with "latestFirst" being false, it's not safe to apply the >>>>>>>>> logic of >>>>>>>>> minimizing entries to memorize, as the option can be changed to true >>>>>>>>> and >>>>>>>>> then we'll read files again. >>>>>>>>> >>>>>>>>> I'm seeing two approaches here: >>>>>>>>> >>>>>>>>> 1) apply "retention" - unlike "maxFileAge", the option would apply >>>>>>>>> to latestFirst as well. That said, if the retention is set to 7 days, >>>>>>>>> the >>>>>>>>> files older than 7 days would never be read in any way. With this >>>>>>>>> approach >>>>>>>>> we can at least get rid of entries which are older than retention. The >>>>>>>>> issue is how to play nicely with existing "maxFileAge", as it also >>>>>>>>> plays >>>>>>>>> similar with the retention, though it's being ignored when >>>>>>>>> latestFirst is >>>>>>>>> turned on. (Change the semantic of "maxFileAge" vs leave it to "soft >>>>>>>>> retention" and introduce another option.) >>>>>>>>> >>>>>>>>> (This approach is being proposed under SPARK-17604, and PR is >>>>>>>>> available - https://github.com/apache/spark/pull/28422) >>>>>>>>> >>>>>>>>> 2) replace "latestFirst" option with alternatives, which no longer >>>>>>>>> read in "backward" order - this doesn't say we have to read all files >>>>>>>>> to >>>>>>>>> move forward. As we do with Kafka, start offset can be provided, >>>>>>>>> ideally as >>>>>>>>> a timestamp, which Spark will read from such timestamp and forward >>>>>>>>> order. >>>>>>>>> This doesn't cover all use cases of "latestFirst", but "latestFirst" >>>>>>>>> doesn't seem to be natural with the concept of SS (think about >>>>>>>>> watermark), >>>>>>>>> I'd prefer to support alternatives instead of struggling with >>>>>>>>> "latestFirst". >>>>>>>>> >>>>>>>>> Would like to hear your opinions. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>>> >>>>>>>>