Re: Supports Dynamic Table Options for Spark SQL

Mich Talebzadeh Tue, 16 Nov 2021 10:39:32 -0800

This concept is explained here
<https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html>
somehow. If this is true why cannot we just use


SELECT * FROM <TABBLE_NAME> VERSION AS OF <VERSION_NO>


  view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 16 Nov 2021 at 17:49, Ryan Blue <b...@tabular.io> wrote:

> Mich, time travel will use the newly added VERSION AS OF or TIMESTAMP AS
> OF syntax.
>
> On Tue, Nov 16, 2021 at 12:40 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> As I stated before, hints are designed to direct the optimizer to choose
>> a certain query execution plan based on the specific criteria.
>>
>>
>> -- time travel
>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>
>>
>> The alternative would be to specify time travel by creating a snapshot
>> based on CURRENT_DATE() range which encapsulates time travel for
>> 'snapshot-id'='10963874102873L'
>>
>>
>> CREATE SNAPSHOT t_snap
>>
>>   START WITH CURRENT_DATE() - 30
>>
>>   NEXT CURRENT_DATE()
>>
>>   AS SELECT * FROM t
>>
>>
>> SELECT * FROM t_snap
>>
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 16 Nov 2021 at 04:26, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>>> My biggest concern with the syntax in hints is that Spark SQL's options
>>> can change results (e.g., CSV's header options) whereas hints are generally
>>> not designed to affect the external results if I am not mistaken. This is
>>> counterintuitive.
>>> I left the comment in the PR but what's the real benefit over
>>> leveraging: SET conf and RESET conf? we can extract options from runtime
>>> session configurations e.g., SessionConfigSupport.
>>>
>>> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Side note about time travel: There is a PR
>>>> <https://github.com/apache/spark/pull/34497> to add VERSION/TIMESTAMP
>>>> AS OF syntax to Spark SQL.
>>>>
>>>> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I want to note that I wouldn't recommend time traveling this way by
>>>>> using the hint for `snapshot-id`. Instead, we want to add the standard SQL
>>>>> syntax for that in a separate PR. This is useful for other options that
>>>>> help a table scan perform better, like specifying the target split size.
>>>>>
>>>>> You're right that this isn't a typical optimizer hint, but I'm not
>>>>> sure what other syntax is possible for this use case. How else would we
>>>>> send custom properties through to the scan?
>>>>>
>>>>> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> I am looking at the hint and it appears to me (I stand corrected), it
>>>>>> is a single table hint as below:
>>>>>>
>>>>>> -- time travel
>>>>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>>>>
>>>>>> My assumption is that any view on this table will also benefit from
>>>>>> this hint. This is not a hint to optimizer in a classical sense. Only a
>>>>>> snapshot hint. Normally, a hint is an instruction to the optimizer.
>>>>>> When writing SQL, one may know information about the data unknown to the
>>>>>> optimizer. Hints enable one to make decisions normally made by the
>>>>>> optimizer, sometimes causing the optimizer to select a plan that it sees 
>>>>>> as
>>>>>> higher cost.
>>>>>>
>>>>>>
>>>>>> So far as this case is concerned, it looks OK and I concur it should
>>>>>> be extended to write as well.
>>>>>>
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> I think since we probably will end up using this same syntax on
>>>>>>> write, this makes a lot of sense. Unless there is another good way to
>>>>>>> express a similar concept during a write operation I think going forward
>>>>>>> with this would be ok.
>>>>>>>
>>>>>>> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> The proposed feature is to be able to pass options through SQL like
>>>>>>>> you would when using the DataFrameReader API, so it would work for
>>>>>>>> all sources that support read options. Read options are part of the 
>>>>>>>> DSv2
>>>>>>>> API, there just isn’t a way to pass options when using SQL. The PR 
>>>>>>>> also has
>>>>>>>> a non-Iceberg example, which is being able to customize some JDBC 
>>>>>>>> source
>>>>>>>> behaviors per query (e.g., fetchSize), rather than globally in the 
>>>>>>>> table’s
>>>>>>>> options.
>>>>>>>>
>>>>>>>> The proposed syntax is odd, but I think that's an artifact of Spark
>>>>>>>> introducing read options that aren't a normal part of SQL. Seems 
>>>>>>>> reasonable
>>>>>>>> to me to pass them through a hint.
>>>>>>>>
>>>>>>>> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Interesting.
>>>>>>>>>
>>>>>>>>> What is this going to add on top of support for Apache Iceberg
>>>>>>>>> <https://www.dremio.com/data-lake/apache-iceberg/>. Will it be in
>>>>>>>>> line with support for Hive ACID tables or Delta Lake?
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 15 Nov 2021 at 01:56, Zhun Wang <wangzhun6...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi dev,
>>>>>>>>>>
>>>>>>>>>> We are discussing Support Dynamic Table Options for Spark SQL (
>>>>>>>>>> https://github.com/apache/spark/pull/34072). It is currently not
>>>>>>>>>> sure if the syntax makes sense, and would like to know if there is 
>>>>>>>>>> other
>>>>>>>>>> feedback or opinion on this.
>>>>>>>>>>
>>>>>>>>>> I would appreciate any feedback on this.
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: Supports Dynamic Table Options for Spark SQL

Reply via email to