bump to see anyone interested or concerned about this.
On Tue, Aug 25, 2020 at 4:56 PM Jungtaek Lim
wrote:
> Bump this again.
>
> On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Bump again.
>>
>> Unlike file stream sink which has lots of limitations and many of us have
>> been suggesting alternatives, file stream source is the only way if end
>> users want to read the data from files. No alternative unless they
>> introduce another ETL & storage (probably Kafka).
>>
>> On Fri, Jul 31, 2020 at 3:06 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi German,
>>>
>>> option 1 isn't about "deleting" the old files, as your input directory
>>> may be accessed by multiple queries. Kafka centralizes the maintenance of
>>> input data hence possible to apply retention without problem.
>>> option 1 is more about "hiding" the old files being read, so that end
>>> users "may" be able to delete the files once they ensure "all queries
>>> accessing the input directory" don't see the old files.
>>>
>>> On Fri, Jul 31, 2020 at 2:57 PM German Schiavon <
>>> gschiavonsp...@gmail.com> wrote:
>>>
HI Jungtaek,
I have a question, aren't both approaches compatible?
How I see it, I think It would be interesting to have a retention
period to delete old files and/or the possibility of indicating an offset
(Timestamp). It would be very "similar" to how we do it with kafka.
WDYT?
On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim <
kabhwan.opensou...@gmail.com> wrote:
> (I'd like to keep the discussion thread focusing on the specific topic
> - let's initiate another discussion threads on different topics.)
>
> Thanks for the input. I'd like to emphasize that the point in
> discussion is the "latestFirst" option - the rationalization starts from
> growing metadata log issues. I hope your input is picking option 2, but
> could you please make clear your input represents OK to "replace" the
> "latestFirst" option with "starting from timestamp"?
>
>
> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal <
> vikram.agra...@gmail.com> wrote:
>
>> If we compare file-stream source with other streaming sources such as
>> Kafka, the current behavior is indeed incomplete. Starting the streaming
>> from a custom offset/particular point of time is something that is
>> missing.
>> Typically filestream sources don't have auto-deletion of the older
>> data/files. In kafka we can define the retention period. So even if we
>> use
>> "Earliest" we won't end up reading from the time when the Kafka topic was
>> created. On the other hand, streaming sources can hold very old files.
>> It's
>> very valid use-cases to read the bulk of the old files using a batch job
>> until a particular timestamp. And then use streaming jobs for real-time
>> updates.
>>
>> So having support where we can specify a timestamp. and we would
>> consider files created post that timestamp can be useful.
>>
>> Another concern which we need to consider is the listing cost. is
>> there any way we can avoid listing the entire base directory and then
>> filtering out the new files. if the data is organized as partitions using
>> date, will it help to list only those partitions where new files were
>> added?
>>
>>
>> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> bump, is there any interest on this topic?
>>>
>>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
(Just to add rationalization, you can refer the original mail
thread on dev@ list to see efforts on addressing problems in file
stream source / sink -
https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
)
On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim <
kabhwan.opensou...@gmail.com> wrote:
> Hi devs,
>
> As I have been going through the various issues on metadata log
> growing, it's not only the issue of sink, but also the issue of
> source.
> Unlike sink metadata log which entries should be available to the
> readers, the source metadata log is only for the streaming query
> starting
> from the checkpoint, hence in theory it should only memorize about
> minimal entries which prevent processing multiple times on the same
> file.
>
> This is not applied to the file stream source, and I think it's
> because of the existence of the "latestFirst" option which I haven't
> seen
> from any sources. The option works as reading files in "backward"
> order,
>