Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-09-27 Thread Jungtaek Lim
bump to see anyone interested or concerned about this.

On Tue, Aug 25, 2020 at 4:56 PM Jungtaek Lim 
wrote:

> Bump this again.
>
> On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Bump again.
>>
>> Unlike file stream sink which has lots of limitations and many of us have
>> been suggesting alternatives, file stream source is the only way if end
>> users want to read the data from files. No alternative unless they
>> introduce another ETL & storage (probably Kafka).
>>
>> On Fri, Jul 31, 2020 at 3:06 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi German,
>>>
>>> option 1 isn't about "deleting" the old files, as your input directory
>>> may be accessed by multiple queries. Kafka centralizes the maintenance of
>>> input data hence possible to apply retention without problem.
>>> option 1 is more about "hiding" the old files being read, so that end
>>> users "may" be able to delete the files once they ensure "all queries
>>> accessing the input directory" don't see the old files.
>>>
>>> On Fri, Jul 31, 2020 at 2:57 PM German Schiavon <
>>> gschiavonsp...@gmail.com> wrote:
>>>
 HI Jungtaek,

 I have a question, aren't both approaches compatible?

 How I see it, I think It would be interesting to have a retention
 period to delete old files and/or the possibility of indicating an offset
 (Timestamp). It would be very "similar" to how we do it with kafka.

 WDYT?

 On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> (I'd like to keep the discussion thread focusing on the specific topic
> - let's initiate another discussion threads on different topics.)
>
> Thanks for the input. I'd like to emphasize that the point in
> discussion is the "latestFirst" option - the rationalization starts from
> growing metadata log issues. I hope your input is picking option 2, but
> could you please make clear your input represents OK to "replace" the
> "latestFirst" option with "starting from timestamp"?
>
>
> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal <
> vikram.agra...@gmail.com> wrote:
>
>> If we compare file-stream source with other streaming sources such as
>> Kafka, the current behavior is indeed incomplete.  Starting the streaming
>> from a custom offset/particular point of time is something that is 
>> missing.
>> Typically filestream sources don't have auto-deletion of the older
>> data/files. In kafka we can define the retention period. So even if we 
>> use
>> "Earliest" we won't end up reading from the time when the Kafka topic was
>> created. On the other hand, streaming sources can hold very old files. 
>> It's
>> very valid use-cases to read the bulk of the old files using a batch job
>> until a particular timestamp. And then use streaming jobs for real-time
>> updates.
>>
>> So having support where we can specify a timestamp. and we would
>> consider files created post that timestamp can be useful.
>>
>> Another concern which we need to consider is the listing cost. is
>> there any way we can avoid listing the entire base directory and then
>> filtering out the new files. if the data is organized as partitions using
>> date, will it help to list only those partitions where new files were
>> added?
>>
>>
>> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> bump, is there any interest on this topic?
>>>
>>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 (Just to add rationalization, you can refer the original mail
 thread on dev@ list to see efforts on addressing problems in file
 stream source / sink -
 https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
 )

 On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Hi devs,
>
> As I have been going through the various issues on metadata log
> growing, it's not only the issue of sink, but also the issue of 
> source.
> Unlike sink metadata log which entries should be available to the
> readers, the source metadata log is only for the streaming query 
> starting
> from the checkpoint, hence in theory it should only memorize about
> minimal entries which prevent processing multiple times on the same 
> file.
>
> This is not applied to the file stream source, and I think it's
> because of the existence of the "latestFirst" option which I haven't 
> seen
> from any sources. The option works as reading files in "backward" 
> order,
> 

Re: Output mode in Structured Streaming and DSv1 sink/DSv2 table

2020-09-27 Thread Jungtaek Lim
bump to see anyone interested or concerned about this

On Sun, Sep 20, 2020 at 1:59 PM Jungtaek Lim 
wrote:

> Hi devs,
>
> We have a capability check in DSv2 defining which operations can be done
> against the data source both read and write. The concept was brought in
> DSv2, so it's not weird for DSv1 to don't have a concept.
>
> In SS the problem arises - if I understand correctly, we would like to
> couple the output mode in the query and the output table. That said,
> complete mode should enforce the output table to truncate the content.
> Update mode should enforce the output table to "upsert" or "delete and
> append" the content.
>
> Nothing has been done against the DSv1 sink - Spark doesn't enforce
> anything and works as append mode, though the query still respects the
> output mode on stateful operations.
>
> I understand we don't want to make end users surprised on broken
> compatibility, but shouldn't it be an "temporary" "exceptional" case
> and DSv2 never does it again? I'm seeing many built-in data sources being
> migrated to DSv2 with the exception of "do nothing for update/truncate",
> which simply destruct the rationalization on capability.
>
> In addition, they don't add TRUNCATE in capability but add
> SupportsTruncate in WriteBuilder, which is weird. It works as of now
> because SS misses checking capability on the writer side (I guess it only
> checks STREAMING_WRITE), but once we check capability in first place,
> things will break.
> (I'm looking into adding a writer plan in SS before analyzer, and check
> capability there.)
>
> What would be our best fix on this issue? Would we leave the
> responsibility of handling "truncate" on the data source (so do nothing is
> fine if it's intended), and just add TRUNCATE to the capability? (That
> should be documented in its data source description though.) Or drop the
> support on truncate if the data source is unable to truncate? (Foreach and
> Kafka output tables will be unable to apply complete mode afterwards.)
>
> Looking forward to hear everyone's thoughts.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>