Re: Supports Dynamic Table Options for Spark SQL

2021-11-18 Thread Mich Talebzadeh
OK on this let us dig a bit deeper  focusing on time travel queries (TTQ).

The interest is to return data as it appeared at a  specific time. So the
discussion is now on how to enable this.

We can specify this  by using a placeholder such as 'AS OF SYSTEM TIME'
after the table name in a FROM

SELECT * FROM t AS OF SYSTEM TIME '2021-11-18 18:45:00'

Meaning return all the rows of table 't' as they appeared at that time.

This syntax makes it pretty standard in line with the existing databases
that support TTQ

HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 17 Nov 2021 at 06:08, Wenchen Fan  wrote:

> It's useful to have a SQL API to specify table options, similar to the
> DataFrameReader API. However, I share the same concern from @Hyukjin Kwon
>  and am not very comfortable with using hints to do
> it.
>
> In the PR, someone mentioned TVF. I think it's better than hints, but
> still has problems. For example, shall we support `FROM read(t AS VERSION
> OF 1, options...)`?
>
> We probably should investigate if there are similar SQL syntaxes in other
> databases first.
>
> On Wed, Nov 17, 2021 at 2:39 AM Mich Talebzadeh 
> wrote:
>
>> This concept is explained here
>> 
>> somehow. If this is true why cannot we just use
>>
>> SELECT * FROM  VERSION AS OF 
>>
>>
>>   view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 16 Nov 2021 at 17:49, Ryan Blue  wrote:
>>
>>> Mich, time travel will use the newly added VERSION AS OF or TIMESTAMP AS
>>> OF syntax.
>>>
>>> On Tue, Nov 16, 2021 at 12:40 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 As I stated before, hints are designed to direct the optimizer to
 choose a certain query execution plan based on the specific criteria.


 -- time travel
 SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */


 The alternative would be to specify time travel by creating a snapshot
 based on CURRENT_DATE() range which encapsulates time travel for
 'snapshot-id'='10963874102873L'


 CREATE SNAPSHOT t_snap

   START WITH CURRENT_DATE() - 30

   NEXT CURRENT_DATE()

   AS SELECT * FROM t


 SELECT * FROM t_snap


 HTH


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 16 Nov 2021 at 04:26, Hyukjin Kwon  wrote:

> My biggest concern with the syntax in hints is that Spark SQL's
> options can change results (e.g., CSV's header options) whereas hints are
> generally not designed to affect the external results if I am not 
> mistaken.
> This is counterintuitive.
> I left the comment in the PR but what's the real benefit over
> leveraging: SET conf and RESET conf? we can extract options from runtime
> session configurations e.g., SessionConfigSupport.
>
> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Side note about time travel: There is a PR
>>  to add
>> VERSION/TIMESTAMP AS OF syntax to Spark SQL.
>>
>> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:
>>
>>> I want to note that I wouldn't recommend time traveling this way by
>>> using the hint for `snapshot-id`. Instead, we want to add the standard 
>>> SQL
>>> syntax for that in a separate PR. This is useful for other options that
>>> help a table scan perform better, like specifying the target split size.
>>>
>>> You're right that this isn't a typical optimizer hint, but I'm not
>>> sure what other syntax is possible for this use case. How else would we
>>> send custom

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Wenchen Fan
It's useful to have a SQL API to specify table options, similar to the
DataFrameReader API. However, I share the same concern from @Hyukjin Kwon
 and am not very comfortable with using hints to do it.

In the PR, someone mentioned TVF. I think it's better than hints, but still
has problems. For example, shall we support `FROM read(t AS VERSION OF 1,
options...)`?

We probably should investigate if there are similar SQL syntaxes in other
databases first.

On Wed, Nov 17, 2021 at 2:39 AM Mich Talebzadeh 
wrote:

> This concept is explained here
> 
> somehow. If this is true why cannot we just use
>
> SELECT * FROM  VERSION AS OF 
>
>
>   view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 16 Nov 2021 at 17:49, Ryan Blue  wrote:
>
>> Mich, time travel will use the newly added VERSION AS OF or TIMESTAMP AS
>> OF syntax.
>>
>> On Tue, Nov 16, 2021 at 12:40 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> As I stated before, hints are designed to direct the optimizer to
>>> choose a certain query execution plan based on the specific criteria.
>>>
>>>
>>> -- time travel
>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>
>>>
>>> The alternative would be to specify time travel by creating a snapshot
>>> based on CURRENT_DATE() range which encapsulates time travel for
>>> 'snapshot-id'='10963874102873L'
>>>
>>>
>>> CREATE SNAPSHOT t_snap
>>>
>>>   START WITH CURRENT_DATE() - 30
>>>
>>>   NEXT CURRENT_DATE()
>>>
>>>   AS SELECT * FROM t
>>>
>>>
>>> SELECT * FROM t_snap
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 16 Nov 2021 at 04:26, Hyukjin Kwon  wrote:
>>>
 My biggest concern with the syntax in hints is that Spark SQL's options
 can change results (e.g., CSV's header options) whereas hints are generally
 not designed to affect the external results if I am not mistaken. This is
 counterintuitive.
 I left the comment in the PR but what's the real benefit over
 leveraging: SET conf and RESET conf? we can extract options from runtime
 session configurations e.g., SessionConfigSupport.

 On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> Side note about time travel: There is a PR
>  to add VERSION/TIMESTAMP
> AS OF syntax to Spark SQL.
>
> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:
>
>> I want to note that I wouldn't recommend time traveling this way by
>> using the hint for `snapshot-id`. Instead, we want to add the standard 
>> SQL
>> syntax for that in a separate PR. This is useful for other options that
>> help a table scan perform better, like specifying the target split size.
>>
>> You're right that this isn't a typical optimizer hint, but I'm not
>> sure what other syntax is possible for this use case. How else would we
>> send custom properties through to the scan?
>>
>> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I am looking at the hint and it appears to me (I stand corrected),
>>> it is a single table hint as below:
>>>
>>> -- time travel
>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>
>>> My assumption is that any view on this table will also benefit from
>>> this hint. This is not a hint to optimizer in a classical sense. Only a
>>> snapshot hint. Normally, a hint is an instruction to the optimizer.
>>> When writing SQL, one may know information about the data unknown to the
>>> optimizer. Hints enable one to make decisions normally made by the
>>> optimizer, sometimes causing the optimizer to select a plan that it 
>>> sees as
>>> higher cost.
>>>
>>>
>>> So far as this case is concerned, it looks OK and I concur it should
>>> be extended to write as well.
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Mich Talebzadeh
This concept is explained here

somehow. If this is true why cannot we just use

SELECT * FROM  VERSION AS OF 


  view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 16 Nov 2021 at 17:49, Ryan Blue  wrote:

> Mich, time travel will use the newly added VERSION AS OF or TIMESTAMP AS
> OF syntax.
>
> On Tue, Nov 16, 2021 at 12:40 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> As I stated before, hints are designed to direct the optimizer to choose
>> a certain query execution plan based on the specific criteria.
>>
>>
>> -- time travel
>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>
>>
>> The alternative would be to specify time travel by creating a snapshot
>> based on CURRENT_DATE() range which encapsulates time travel for
>> 'snapshot-id'='10963874102873L'
>>
>>
>> CREATE SNAPSHOT t_snap
>>
>>   START WITH CURRENT_DATE() - 30
>>
>>   NEXT CURRENT_DATE()
>>
>>   AS SELECT * FROM t
>>
>>
>> SELECT * FROM t_snap
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 16 Nov 2021 at 04:26, Hyukjin Kwon  wrote:
>>
>>> My biggest concern with the syntax in hints is that Spark SQL's options
>>> can change results (e.g., CSV's header options) whereas hints are generally
>>> not designed to affect the external results if I am not mistaken. This is
>>> counterintuitive.
>>> I left the comment in the PR but what's the real benefit over
>>> leveraging: SET conf and RESET conf? we can extract options from runtime
>>> session configurations e.g., SessionConfigSupport.
>>>
>>> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Side note about time travel: There is a PR
  to add VERSION/TIMESTAMP
 AS OF syntax to Spark SQL.

 On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:

> I want to note that I wouldn't recommend time traveling this way by
> using the hint for `snapshot-id`. Instead, we want to add the standard SQL
> syntax for that in a separate PR. This is useful for other options that
> help a table scan perform better, like specifying the target split size.
>
> You're right that this isn't a typical optimizer hint, but I'm not
> sure what other syntax is possible for this use case. How else would we
> send custom properties through to the scan?
>
> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> I am looking at the hint and it appears to me (I stand corrected), it
>> is a single table hint as below:
>>
>> -- time travel
>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>
>> My assumption is that any view on this table will also benefit from
>> this hint. This is not a hint to optimizer in a classical sense. Only a
>> snapshot hint. Normally, a hint is an instruction to the optimizer.
>> When writing SQL, one may know information about the data unknown to the
>> optimizer. Hints enable one to make decisions normally made by the
>> optimizer, sometimes causing the optimizer to select a plan that it sees 
>> as
>> higher cost.
>>
>>
>> So far as this case is concerned, it looks OK and I concur it should
>> be extended to write as well.
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I think since we probably will end up using this same syntax on
>>> write, this makes a lot of sense.

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Ryan Blue
Mich, time travel will use the newly added VERSION AS OF or TIMESTAMP AS OF
syntax.

On Tue, Nov 16, 2021 at 12:40 AM Mich Talebzadeh 
wrote:

> As I stated before, hints are designed to direct the optimizer to choose
> a certain query execution plan based on the specific criteria.
>
>
> -- time travel
> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>
>
> The alternative would be to specify time travel by creating a snapshot
> based on CURRENT_DATE() range which encapsulates time travel for
> 'snapshot-id'='10963874102873L'
>
>
> CREATE SNAPSHOT t_snap
>
>   START WITH CURRENT_DATE() - 30
>
>   NEXT CURRENT_DATE()
>
>   AS SELECT * FROM t
>
>
> SELECT * FROM t_snap
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 16 Nov 2021 at 04:26, Hyukjin Kwon  wrote:
>
>> My biggest concern with the syntax in hints is that Spark SQL's options
>> can change results (e.g., CSV's header options) whereas hints are generally
>> not designed to affect the external results if I am not mistaken. This is
>> counterintuitive.
>> I left the comment in the PR but what's the real benefit over leveraging:
>> SET conf and RESET conf? we can extract options from runtime session
>> configurations e.g., SessionConfigSupport.
>>
>> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Side note about time travel: There is a PR
>>>  to add VERSION/TIMESTAMP
>>> AS OF syntax to Spark SQL.
>>>
>>> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:
>>>
 I want to note that I wouldn't recommend time traveling this way by
 using the hint for `snapshot-id`. Instead, we want to add the standard SQL
 syntax for that in a separate PR. This is useful for other options that
 help a table scan perform better, like specifying the target split size.

 You're right that this isn't a typical optimizer hint, but I'm not sure
 what other syntax is possible for this use case. How else would we send
 custom properties through to the scan?

 On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> I am looking at the hint and it appears to me (I stand corrected), it
> is a single table hint as below:
>
> -- time travel
> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>
> My assumption is that any view on this table will also benefit from
> this hint. This is not a hint to optimizer in a classical sense. Only a
> snapshot hint. Normally, a hint is an instruction to the optimizer.
> When writing SQL, one may know information about the data unknown to the
> optimizer. Hints enable one to make decisions normally made by the
> optimizer, sometimes causing the optimizer to select a plan that it sees 
> as
> higher cost.
>
>
> So far as this case is concerned, it looks OK and I concur it should
> be extended to write as well.
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> I think since we probably will end up using this same syntax on
>> write, this makes a lot of sense. Unless there is another good way to
>> express a similar concept during a write operation I think going forward
>> with this would be ok.
>>
>> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:
>>
>>> The proposed feature is to be able to pass options through SQL like
>>> you would when using the DataFrameReader API, so it would work for
>>> all sources that support read options. Read options are part of the DSv2
>>> API, there just isn’t a way to pass options when using SQL. The PR also 
>>> has
>>> a non-Iceberg example, which is being able to customize some JDBC source
>>> behaviors per query (e.g., fetchSize), rather than globally in the 
>>> table’s
>>> options.
>>>
>>> The proposed syntax is odd, but I think that's an artifact of Spark
>>> introducing read optio

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Mich Talebzadeh
As I stated before, hints are designed to direct the optimizer to choose a
certain query execution plan based on the specific criteria.


-- time travel
SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */


The alternative would be to specify time travel by creating a snapshot
based on CURRENT_DATE() range which encapsulates time travel for
'snapshot-id'='10963874102873L'


CREATE SNAPSHOT t_snap

  START WITH CURRENT_DATE() - 30

  NEXT CURRENT_DATE()

  AS SELECT * FROM t


SELECT * FROM t_snap


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 16 Nov 2021 at 04:26, Hyukjin Kwon  wrote:

> My biggest concern with the syntax in hints is that Spark SQL's options
> can change results (e.g., CSV's header options) whereas hints are generally
> not designed to affect the external results if I am not mistaken. This is
> counterintuitive.
> I left the comment in the PR but what's the real benefit over leveraging:
> SET conf and RESET conf? we can extract options from runtime session
> configurations e.g., SessionConfigSupport.
>
> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas 
> wrote:
>
>> Side note about time travel: There is a PR
>>  to add VERSION/TIMESTAMP AS
>> OF syntax to Spark SQL.
>>
>> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:
>>
>>> I want to note that I wouldn't recommend time traveling this way by
>>> using the hint for `snapshot-id`. Instead, we want to add the standard SQL
>>> syntax for that in a separate PR. This is useful for other options that
>>> help a table scan perform better, like specifying the target split size.
>>>
>>> You're right that this isn't a typical optimizer hint, but I'm not sure
>>> what other syntax is possible for this use case. How else would we send
>>> custom properties through to the scan?
>>>
>>> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 I am looking at the hint and it appears to me (I stand corrected), it
 is a single table hint as below:

 -- time travel
 SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */

 My assumption is that any view on this table will also benefit from
 this hint. This is not a hint to optimizer in a classical sense. Only a
 snapshot hint. Normally, a hint is an instruction to the optimizer.
 When writing SQL, one may know information about the data unknown to the
 optimizer. Hints enable one to make decisions normally made by the
 optimizer, sometimes causing the optimizer to select a plan that it sees as
 higher cost.


 So far as this case is concerned, it looks OK and I concur it should be
 extended to write as well.


 HTH


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Mon, 15 Nov 2021 at 17:02, Russell Spitzer <
 russell.spit...@gmail.com> wrote:

> I think since we probably will end up using this same syntax on write,
> this makes a lot of sense. Unless there is another good way to express a
> similar concept during a write operation I think going forward with this
> would be ok.
>
> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:
>
>> The proposed feature is to be able to pass options through SQL like
>> you would when using the DataFrameReader API, so it would work for
>> all sources that support read options. Read options are part of the DSv2
>> API, there just isn’t a way to pass options when using SQL. The PR also 
>> has
>> a non-Iceberg example, which is being able to customize some JDBC source
>> behaviors per query (e.g., fetchSize), rather than globally in the 
>> table’s
>> options.
>>
>> The proposed syntax is odd, but I think that's an artifact of Spark
>> introducing read options that aren't a normal part of SQL. Seems 
>> reasonable
>> to me to pass them through a hint.
>>
>> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Interesting.
>>>
>>> What is this going to add on top of support for Apache Iceberg
>>> 

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Hyukjin Kwon
My biggest concern with the syntax in hints is that Spark SQL's options can
change results (e.g., CSV's header options) whereas hints are generally not
designed to affect the external results if I am not mistaken. This is
counterintuitive.
I left the comment in the PR but what's the real benefit over leveraging:
SET conf and RESET conf? we can extract options from runtime session
configurations e.g., SessionConfigSupport.

On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas 
wrote:

> Side note about time travel: There is a PR
>  to add VERSION/TIMESTAMP AS
> OF syntax to Spark SQL.
>
> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:
>
>> I want to note that I wouldn't recommend time traveling this way by using
>> the hint for `snapshot-id`. Instead, we want to add the standard SQL syntax
>> for that in a separate PR. This is useful for other options that help a
>> table scan perform better, like specifying the target split size.
>>
>> You're right that this isn't a typical optimizer hint, but I'm not sure
>> what other syntax is possible for this use case. How else would we send
>> custom properties through to the scan?
>>
>> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I am looking at the hint and it appears to me (I stand corrected), it is
>>> a single table hint as below:
>>>
>>> -- time travel
>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>
>>> My assumption is that any view on this table will also benefit from this
>>> hint. This is not a hint to optimizer in a classical sense. Only a snapshot
>>> hint. Normally, a hint is an instruction to the optimizer. When writing
>>> SQL, one may know information about the data unknown to the optimizer.
>>> Hints enable one to make decisions normally made by the optimizer,
>>> sometimes causing the optimizer to select a plan that it sees as higher
>>> cost.
>>>
>>>
>>> So far as this case is concerned, it looks OK and I concur it should be
>>> extended to write as well.
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer 
>>> wrote:
>>>
 I think since we probably will end up using this same syntax on write,
 this makes a lot of sense. Unless there is another good way to express a
 similar concept during a write operation I think going forward with this
 would be ok.

 On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:

> The proposed feature is to be able to pass options through SQL like
> you would when using the DataFrameReader API, so it would work for
> all sources that support read options. Read options are part of the DSv2
> API, there just isn’t a way to pass options when using SQL. The PR also 
> has
> a non-Iceberg example, which is being able to customize some JDBC source
> behaviors per query (e.g., fetchSize), rather than globally in the table’s
> options.
>
> The proposed syntax is odd, but I think that's an artifact of Spark
> introducing read options that aren't a normal part of SQL. Seems 
> reasonable
> to me to pass them through a hint.
>
> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Interesting.
>>
>> What is this going to add on top of support for Apache Iceberg
>> . Will it be in
>> line with support for Hive ACID tables or Delta Lake?
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 15 Nov 2021 at 01:56, Zhun Wang 
>> wrote:
>>
>>> Hi dev,
>>>
>>> We are discussing Support Dynamic Table Options for Spark SQL (
>>> https://github.com/apache/spark/pull/34072). It is currently not
>>> sure if the syntax makes sense, and would like to know if there is other
>>> feedback or opinion on this.
>>>
>>> I would appreciate any feedback on this.
>>>
>>> Thanks.
>>>
>>
>

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Nicholas Chammas
Side note about time travel: There is a PR
 to add VERSION/TIMESTAMP AS OF
syntax to Spark SQL.

On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:

> I want to note that I wouldn't recommend time traveling this way by using
> the hint for `snapshot-id`. Instead, we want to add the standard SQL syntax
> for that in a separate PR. This is useful for other options that help a
> table scan perform better, like specifying the target split size.
>
> You're right that this isn't a typical optimizer hint, but I'm not sure
> what other syntax is possible for this use case. How else would we send
> custom properties through to the scan?
>
> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh 
> wrote:
>
>> I am looking at the hint and it appears to me (I stand corrected), it is
>> a single table hint as below:
>>
>> -- time travel
>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>
>> My assumption is that any view on this table will also benefit from this
>> hint. This is not a hint to optimizer in a classical sense. Only a snapshot
>> hint. Normally, a hint is an instruction to the optimizer. When writing
>> SQL, one may know information about the data unknown to the optimizer.
>> Hints enable one to make decisions normally made by the optimizer,
>> sometimes causing the optimizer to select a plan that it sees as higher
>> cost.
>>
>>
>> So far as this case is concerned, it looks OK and I concur it should be
>> extended to write as well.
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer 
>> wrote:
>>
>>> I think since we probably will end up using this same syntax on write,
>>> this makes a lot of sense. Unless there is another good way to express a
>>> similar concept during a write operation I think going forward with this
>>> would be ok.
>>>
>>> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:
>>>
 The proposed feature is to be able to pass options through SQL like you
 would when using the DataFrameReader API, so it would work for all
 sources that support read options. Read options are part of the DSv2 API,
 there just isn’t a way to pass options when using SQL. The PR also has a
 non-Iceberg example, which is being able to customize some JDBC source
 behaviors per query (e.g., fetchSize), rather than globally in the table’s
 options.

 The proposed syntax is odd, but I think that's an artifact of Spark
 introducing read options that aren't a normal part of SQL. Seems reasonable
 to me to pass them through a hint.

 On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Interesting.
>
> What is this going to add on top of support for Apache Iceberg
> . Will it be in
> line with support for Hive ACID tables or Delta Lake?
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 01:56, Zhun Wang 
> wrote:
>
>> Hi dev,
>>
>> We are discussing Support Dynamic Table Options for Spark SQL (
>> https://github.com/apache/spark/pull/34072). It is currently not
>> sure if the syntax makes sense, and would like to know if there is other
>> feedback or opinion on this.
>>
>> I would appreciate any feedback on this.
>>
>> Thanks.
>>
>

 --
 Ryan Blue
 Tabular

>>>
>
> --
> Ryan Blue
> Tabular
>


Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Ryan Blue
I want to note that I wouldn't recommend time traveling this way by using
the hint for `snapshot-id`. Instead, we want to add the standard SQL syntax
for that in a separate PR. This is useful for other options that help a
table scan perform better, like specifying the target split size.

You're right that this isn't a typical optimizer hint, but I'm not sure
what other syntax is possible for this use case. How else would we send
custom properties through to the scan?

On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh 
wrote:

> I am looking at the hint and it appears to me (I stand corrected), it is a
> single table hint as below:
>
> -- time travel
> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>
> My assumption is that any view on this table will also benefit from this
> hint. This is not a hint to optimizer in a classical sense. Only a snapshot
> hint. Normally, a hint is an instruction to the optimizer. When writing
> SQL, one may know information about the data unknown to the optimizer.
> Hints enable one to make decisions normally made by the optimizer,
> sometimes causing the optimizer to select a plan that it sees as higher
> cost.
>
>
> So far as this case is concerned, it looks OK and I concur it should be
> extended to write as well.
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer 
> wrote:
>
>> I think since we probably will end up using this same syntax on write,
>> this makes a lot of sense. Unless there is another good way to express a
>> similar concept during a write operation I think going forward with this
>> would be ok.
>>
>> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:
>>
>>> The proposed feature is to be able to pass options through SQL like you
>>> would when using the DataFrameReader API, so it would work for all
>>> sources that support read options. Read options are part of the DSv2 API,
>>> there just isn’t a way to pass options when using SQL. The PR also has a
>>> non-Iceberg example, which is being able to customize some JDBC source
>>> behaviors per query (e.g., fetchSize), rather than globally in the table’s
>>> options.
>>>
>>> The proposed syntax is odd, but I think that's an artifact of Spark
>>> introducing read options that aren't a normal part of SQL. Seems reasonable
>>> to me to pass them through a hint.
>>>
>>> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Interesting.

 What is this going to add on top of support for Apache Iceberg
 . Will it be in line
 with support for Hive ACID tables or Delta Lake?

 HTH



view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Mon, 15 Nov 2021 at 01:56, Zhun Wang  wrote:

> Hi dev,
>
> We are discussing Support Dynamic Table Options for Spark SQL (
> https://github.com/apache/spark/pull/34072). It is currently not sure
> if the syntax makes sense, and would like to know if there is other
> feedback or opinion on this.
>
> I would appreciate any feedback on this.
>
> Thanks.
>

>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular


Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Mich Talebzadeh
I am looking at the hint and it appears to me (I stand corrected), it is a
single table hint as below:

-- time travel
SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */

My assumption is that any view on this table will also benefit from this
hint. This is not a hint to optimizer in a classical sense. Only a snapshot
hint. Normally, a hint is an instruction to the optimizer. When writing
SQL, one may know information about the data unknown to the optimizer.
Hints enable one to make decisions normally made by the optimizer,
sometimes causing the optimizer to select a plan that it sees as higher
cost.


So far as this case is concerned, it looks OK and I concur it should be
extended to write as well.


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 15 Nov 2021 at 17:02, Russell Spitzer 
wrote:

> I think since we probably will end up using this same syntax on write,
> this makes a lot of sense. Unless there is another good way to express a
> similar concept during a write operation I think going forward with this
> would be ok.
>
> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:
>
>> The proposed feature is to be able to pass options through SQL like you
>> would when using the DataFrameReader API, so it would work for all
>> sources that support read options. Read options are part of the DSv2 API,
>> there just isn’t a way to pass options when using SQL. The PR also has a
>> non-Iceberg example, which is being able to customize some JDBC source
>> behaviors per query (e.g., fetchSize), rather than globally in the table’s
>> options.
>>
>> The proposed syntax is odd, but I think that's an artifact of Spark
>> introducing read options that aren't a normal part of SQL. Seems reasonable
>> to me to pass them through a hint.
>>
>> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Interesting.
>>>
>>> What is this going to add on top of support for Apache Iceberg
>>> . Will it be in line
>>> with support for Hive ACID tables or Delta Lake?
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 15 Nov 2021 at 01:56, Zhun Wang  wrote:
>>>
 Hi dev,

 We are discussing Support Dynamic Table Options for Spark SQL (
 https://github.com/apache/spark/pull/34072). It is currently not sure
 if the syntax makes sense, and would like to know if there is other
 feedback or opinion on this.

 I would appreciate any feedback on this.

 Thanks.

>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Russell Spitzer
I think since we probably will end up using this same syntax on write, this
makes a lot of sense. Unless there is another good way to express a similar
concept during a write operation I think going forward with this would be
ok.

On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:

> The proposed feature is to be able to pass options through SQL like you
> would when using the DataFrameReader API, so it would work for all
> sources that support read options. Read options are part of the DSv2 API,
> there just isn’t a way to pass options when using SQL. The PR also has a
> non-Iceberg example, which is being able to customize some JDBC source
> behaviors per query (e.g., fetchSize), rather than globally in the table’s
> options.
>
> The proposed syntax is odd, but I think that's an artifact of Spark
> introducing read options that aren't a normal part of SQL. Seems reasonable
> to me to pass them through a hint.
>
> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh 
> wrote:
>
>> Interesting.
>>
>> What is this going to add on top of support for Apache Iceberg
>> . Will it be in line
>> with support for Hive ACID tables or Delta Lake?
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 15 Nov 2021 at 01:56, Zhun Wang  wrote:
>>
>>> Hi dev,
>>>
>>> We are discussing Support Dynamic Table Options for Spark SQL (
>>> https://github.com/apache/spark/pull/34072). It is currently not sure
>>> if the syntax makes sense, and would like to know if there is other
>>> feedback or opinion on this.
>>>
>>> I would appreciate any feedback on this.
>>>
>>> Thanks.
>>>
>>
>
> --
> Ryan Blue
> Tabular
>


Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Ryan Blue
The proposed feature is to be able to pass options through SQL like you
would when using the DataFrameReader API, so it would work for all sources
that support read options. Read options are part of the DSv2 API, there
just isn’t a way to pass options when using SQL. The PR also has a
non-Iceberg example, which is being able to customize some JDBC source
behaviors per query (e.g., fetchSize), rather than globally in the table’s
options.

The proposed syntax is odd, but I think that's an artifact of Spark
introducing read options that aren't a normal part of SQL. Seems reasonable
to me to pass them through a hint.

On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh 
wrote:

> Interesting.
>
> What is this going to add on top of support for Apache Iceberg
> . Will it be in line
> with support for Hive ACID tables or Delta Lake?
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 01:56, Zhun Wang  wrote:
>
>> Hi dev,
>>
>> We are discussing Support Dynamic Table Options for Spark SQL (
>> https://github.com/apache/spark/pull/34072). It is currently not sure if
>> the syntax makes sense, and would like to know if there is other feedback
>> or opinion on this.
>>
>> I would appreciate any feedback on this.
>>
>> Thanks.
>>
>

-- 
Ryan Blue
Tabular


Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Mich Talebzadeh
Interesting.

What is this going to add on top of support for Apache Iceberg
. Will it be in line with
support for Hive ACID tables or Delta Lake?

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 15 Nov 2021 at 01:56, Zhun Wang  wrote:

> Hi dev,
>
> We are discussing Support Dynamic Table Options for Spark SQL (
> https://github.com/apache/spark/pull/34072). It is currently not sure if
> the syntax makes sense, and would like to know if there is other feedback
> or opinion on this.
>
> I would appreciate any feedback on this.
>
> Thanks.
>


Supports Dynamic Table Options for Spark SQL

2021-11-14 Thread Zhun Wang
Hi dev,

We are discussing Support Dynamic Table Options for Spark SQL (
https://github.com/apache/spark/pull/34072). It is currently not sure if
the syntax makes sense, and would like to know if there is other feedback
or opinion on this.

I would appreciate any feedback on this.

Thanks.