Re: Supports Dynamic Table Options for Spark SQL

Hyukjin Kwon Mon, 15 Nov 2021 20:26:36 -0800

My biggest concern with the syntax in hints is that Spark SQL's options can
change results (e.g., CSV's header options) whereas hints are generally not
designed to affect the external results if I am not mistaken. This is
counterintuitive.
I left the comment in the PR but what's the real benefit over leveraging:
SET conf and RESET conf? we can extract options from runtime session
configurations e.g., SessionConfigSupport.


On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> Side note about time travel: There is a PR
> <https://github.com/apache/spark/pull/34497> to add VERSION/TIMESTAMP AS
> OF syntax to Spark SQL.
>
> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue <b...@tabular.io> wrote:
>
>> I want to note that I wouldn't recommend time traveling this way by using
>> the hint for `snapshot-id`. Instead, we want to add the standard SQL syntax
>> for that in a separate PR. This is useful for other options that help a
>> table scan perform better, like specifying the target split size.
>>
>> You're right that this isn't a typical optimizer hint, but I'm not sure
>> what other syntax is possible for this use case. How else would we send
>> custom properties through to the scan?
>>
>> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I am looking at the hint and it appears to me (I stand corrected), it is
>>> a single table hint as below:
>>>
>>> -- time travel
>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>
>>> My assumption is that any view on this table will also benefit from this
>>> hint. This is not a hint to optimizer in a classical sense. Only a snapshot
>>> hint. Normally, a hint is an instruction to the optimizer. When writing
>>> SQL, one may know information about the data unknown to the optimizer.
>>> Hints enable one to make decisions normally made by the optimizer,
>>> sometimes causing the optimizer to select a plan that it sees as higher
>>> cost.
>>>
>>>
>>> So far as this case is concerned, it looks OK and I concur it should be
>>> extended to write as well.
>>>
>>>
>>> HTH
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer <russell.spit...@gmail.com>
>>> wrote:
>>>
>>>> I think since we probably will end up using this same syntax on write,
>>>> this makes a lot of sense. Unless there is another good way to express a
>>>> similar concept during a write operation I think going forward with this
>>>> would be ok.
>>>>
>>>> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> The proposed feature is to be able to pass options through SQL like
>>>>> you would when using the DataFrameReader API, so it would work for
>>>>> all sources that support read options. Read options are part of the DSv2
>>>>> API, there just isn’t a way to pass options when using SQL. The PR also 
>>>>> has
>>>>> a non-Iceberg example, which is being able to customize some JDBC source
>>>>> behaviors per query (e.g., fetchSize), rather than globally in the table’s
>>>>> options.
>>>>>
>>>>> The proposed syntax is odd, but I think that's an artifact of Spark
>>>>> introducing read options that aren't a normal part of SQL. Seems 
>>>>> reasonable
>>>>> to me to pass them through a hint.
>>>>>
>>>>> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Interesting.
>>>>>>
>>>>>> What is this going to add on top of support for Apache Iceberg
>>>>>> <https://www.dremio.com/data-lake/apache-iceberg/>. Will it be in
>>>>>> line with support for Hive ACID tables or Delta Lake?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 15 Nov 2021 at 01:56, Zhun Wang <wangzhun6...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi dev,
>>>>>>>
>>>>>>> We are discussing Support Dynamic Table Options for Spark SQL (
>>>>>>> https://github.com/apache/spark/pull/34072). It is currently not
>>>>>>> sure if the syntax makes sense, and would like to know if there is other
>>>>>>> feedback or opinion on this.
>>>>>>>
>>>>>>> I would appreciate any feedback on this.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Supports Dynamic Table Options for Spark SQL

Reply via email to