Thanks Marko for the detailed answer and the references.

Cheers!

On Mon, Aug 21, 2023 at 1:57 PM Marko Grujic <mark...@gmail.com> wrote:

> Hi Akshara,
>
> > Just for my understanding - the proposal assumes that writes will result
> in
> > a new table version correct?
>
> Actually, the implementation I had in mind does not make any assumptions
> about the
> behaviour of writes, it only accounts for the fact that there may be
> different versions of the
> same table available, and so we need to be able to discern them. But in
> general yeah,
> each write usually results in a new table version.
>
> > schema version itself does not change, rather the records are appendonly
> > and have a timestamp associated with it ( typically in an 'internal'
> > column).
>
> Correct, though schemas can sometimes also evolve (e.g. to facilitate
> ALTER TABLE statements).
> I’m particularly interested in Delta Lake (delta-rs), and in that case the
> protocol tracks the timestamp
> via the logs[1].
>
> > Something like an "AS OF TIMESTAMP" support, basically.
>
> Indeed, this is the most common approach, i.e. having a timestamp to
> delineate
> different table versions. delta-rs also supports explicit version
> referencing[2], while other systems (BigQuery,
> Snowflake) also support offset intervals (e.g. '1 day ago').
>
> I’ve opened a PR to sqlparser that introduces the basic timestamp table
> version referencing
> support for now[3]. This is a prerequisite for DataFusion table time
> travel, though even if people
> don’t agree to pursue that goal it can be independently useful on its own.
>
> Thanks,
> Marko
>
> [1] https://www.splitgraph.com/blog/seafowl-delta-storage-layer
> [2]
> https://docs.rs/deltalake/latest/deltalake/builder/enum.DeltaVersion.html
> [3] https://github.com/sqlparser-rs/sqlparser-rs/pull/951
>
> On 2023/08/19 14:48:10 Akshara Uke wrote:
> > Hi Marko,
> >
> > Indeed most databases do support time travel/stale reads (specially
> > distributed databases) , hence an important feature,IMHO.
> >
> > Just for my understanding - the proposal assumes that writes will result
> in
> > a new table version correct?
> > Asking since, some databases provide stale read support - but the table
> > schema version itself does not change, rather the records are appendonly
> > and have a timestamp associated with it ( typically in an 'internal'
> > column).
> > Perhaps the solution can be extended to have the facility to specify/tag
> ,
> > in the table structure, a column as a commit timestamp tracker , then it
> > can be used to provide stale reads based on a timestamp as well.
> >
> > Something like an "AS OF TIMESTAMP" support, basically.
> >
> > Hope it makes sense.
> >
> > Thanks,
> > akshara
> >
> > On Fri, Aug 18, 2023 at 8:35 AM Marko Grujic <ma...@gmail.com> wrote:
> >
> > > Hi all!
> > >
> > > I'm wondering what people think of a possibility to extend DataFusion
> so as
> > > to accommodate time-travel querying? This would work well with the new
> > > table formats, particularly Iceberg and Delta Lake, where table
> versioning
> > > is at the core of the protocol.
> > >
> > > You can see some details in the issue I raised below[1], but the TLDR
> of
> > > the work I see is:
> > > 1. extend sqlparser-rs to be aware of the `AS OF` clause (or something
> else
> > > people prefer)
> > > 2. capture that information inside `TableFactor::Table
> > > <
> > >
> https://github.com/sqlparser-rs/sqlparser-rs/blob/main/src/ast/query.rs#L650-L664
> > > >`
> > > expression
> > > 3. then in DataFusion itself while building `SessionContextProvider`
> and
> > > pre-populating the tables for a given query keep track of both the
> table
> > > version and table name specified
> > > 4. this would also mean a breaking change in the
> `SchemaProvider::table`
> > > along the lines of
> > > ```rust
> > > async fn table(&self, name: &str, version: Option<TableVersion>) ->
> > > Option<Arc<dyn TableProvider>>
> > > ```
> > > which would allow the provider implementation to be version-aware
> > >
> > > I'd be glad to commence work on this if there's consensus on the
> addition
> > > of such a feature to DataFusion.
> > >
> > > Cheers,
> > > Marko
> > >
> > > [1] https://github.com/apache/arrow-datafusion/issues/7292
> > >
> >

Reply via email to