Re: [DISCUSS] PostgreSQL dialect

2019-12-04 Thread Yuanjian Li
Thanks all of you for joining the discussion.
The PR is given in https://github.com/apache/spark/pull/26763, all the
PostgreSQL dialect related PRs are linked in the description.
Hoping the authors could help in reviewing.

Best,
Yuanjian

Driesprong, Fokko  于2019年12月1日周日 下午7:24写道:

> +1 (non-binding)
>
> Cheers, Fokko
>
> Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun  >:
>
>> +1
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro 
>> wrote:
>>
>>> Yea, +1, that looks pretty reasonable to me.
>>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>> from the codebase before it's too late. Curently we only have 3 features
>>> under PostgreSQL dialect:
>>> I personally think we could at least stop work about the Dialect until
>>> 3.0 released.
>>>
>>>
>>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>>> gengliang.w...@databricks.com> wrote:
>>>
 +1 with the practical proposal.
 To me, the major concern is that the code base becomes complicated,
 while the PostgreSQL dialect has very limited features. I tried introducing
 one big flag `spark.sql.dialect` and isolating related code in #25697
 , but it seems hard to be
 clean.
 Furthermore, the PostgreSQL dialect configuration overlaps with the
 ANSI mode, which can be confusing sometimes.

 Gengliang

 On Tue, Nov 26, 2019 at 8:57 AM Xiao Li  wrote:

> +1
>
>
>> One particular negative effect has been that new postgresql tests add
>> well over an hour to tests,
>
>
> Adding postgresql tests is for improving the test coverage of Spark
> SQL. We should continue to do this by importing more test cases. The
> quality of Spark highly depends on the test coverage. We can further
> paralyze the test execution to reduce the test time.
>
> Migrating PostgreSQL workloads to Spark SQL
>
>
> This should not be our current focus. In the near future, it is
> impossible to be fully compatible with PostgreSQL. We should focus on
> adding features that are useful to Spark community. PostgreSQL is a good
> reference, but we do not need to blindly follow it. We already closed
> multiple related JIRAs that try to add some PostgreSQL features that are
> not commonly used.
>
> Cheers,
>
> Xiao
>
>
> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> I think it is important to distinguish between two different concepts:
>>
>>- Adherence to standards and their well established
>>implementations.
>>- Enabling migrations from some product X to Spark.
>>
>> While these two problems are related, there are independent and one
>> can be achieved without the other.
>>
>>- The former approach doesn't imply that all features of SQL
>>standard (or its specific implementation) are provided. It is 
>> sufficient
>>that commonly used features that are implemented, are standard 
>> compliant.
>>Therefore if end user applies some well known pattern, thing will 
>> work as
>>expected. I
>>
>>In my personal opinion that's something that is worth the
>>required development resources, and in general should happen within 
>> the
>>project.
>>
>>
>>- The latter one is more complicated. First of all the premise
>>that one can "migrate PostgreSQL workloads to Spark" seems to be 
>> flawed.
>>While both Spark and PostgreSQL evolve, and probably have more in 
>> common
>>today, than a few years ago, they're not even close enough to pretend 
>> that
>>one can be replacement for the other. In contrast, existing 
>> compatibility
>>layers between major vendors make sense, because feature disparity
>>(at least when it comes to core functionality) is usually
>>minimal. And that doesn't even touch the problem that PostgreSQL 
>> provides
>>extensively used extension points that enable broad and evolving 
>> ecosystem
>>(what should we do about continuous queries? Should Structured 
>> Streaming
>>provide some compatibility layer as well?).
>>
>>More realistically Spark could provide a compatibility layer with
>>some analytical tools that itself provide some PostgreSQL 
>> compatibility,
>>but these are not always fully compatible with upstream PostgreSQL, 
>> nor
>>necessarily follow the latest PostgreSQL development.
>>
>>Furthermore compatibility layer can be, within certain limits
>>(i.e. availability of required primitives), maintained as a separate
>>project, without putting more strain on existing resources. 
>> Effectively
>>what we care about here is if we can tr

Re: [DISCUSS] PostgreSQL dialect

2019-12-01 Thread Driesprong, Fokko
+1 (non-binding)

Cheers, Fokko

Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun :

> +1
>
> Bests,
> Dongjoon.
>
> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro 
> wrote:
>
>> Yea, +1, that looks pretty reasonable to me.
>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>> from the codebase before it's too late. Curently we only have 3 features
>> under PostgreSQL dialect:
>> I personally think we could at least stop work about the Dialect until
>> 3.0 released.
>>
>>
>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>> gengliang.w...@databricks.com> wrote:
>>
>>> +1 with the practical proposal.
>>> To me, the major concern is that the code base becomes complicated,
>>> while the PostgreSQL dialect has very limited features. I tried introducing
>>> one big flag `spark.sql.dialect` and isolating related code in #25697
>>> , but it seems hard to be
>>> clean.
>>> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
>>> mode, which can be confusing sometimes.
>>>
>>> Gengliang
>>>
>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li  wrote:
>>>
 +1


> One particular negative effect has been that new postgresql tests add
> well over an hour to tests,


 Adding postgresql tests is for improving the test coverage of Spark
 SQL. We should continue to do this by importing more test cases. The
 quality of Spark highly depends on the test coverage. We can further
 paralyze the test execution to reduce the test time.

 Migrating PostgreSQL workloads to Spark SQL


 This should not be our current focus. In the near future, it is
 impossible to be fully compatible with PostgreSQL. We should focus on
 adding features that are useful to Spark community. PostgreSQL is a good
 reference, but we do not need to blindly follow it. We already closed
 multiple related JIRAs that try to add some PostgreSQL features that are
 not commonly used.

 Cheers,

 Xiao


 On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
 mszymkiew...@gmail.com> wrote:

> I think it is important to distinguish between two different concepts:
>
>- Adherence to standards and their well established
>implementations.
>- Enabling migrations from some product X to Spark.
>
> While these two problems are related, there are independent and one
> can be achieved without the other.
>
>- The former approach doesn't imply that all features of SQL
>standard (or its specific implementation) are provided. It is 
> sufficient
>that commonly used features that are implemented, are standard 
> compliant.
>Therefore if end user applies some well known pattern, thing will work 
> as
>expected. I
>
>In my personal opinion that's something that is worth the required
>development resources, and in general should happen within the project.
>
>
>- The latter one is more complicated. First of all the premise
>that one can "migrate PostgreSQL workloads to Spark" seems to be 
> flawed.
>While both Spark and PostgreSQL evolve, and probably have more in 
> common
>today, than a few years ago, they're not even close enough to pretend 
> that
>one can be replacement for the other. In contrast, existing 
> compatibility
>layers between major vendors make sense, because feature disparity (at
>least when it comes to core functionality) is usually minimal. And that
>doesn't even touch the problem that PostgreSQL provides extensively 
> used
>extension points that enable broad and evolving ecosystem (what should 
> we
>do about continuous queries? Should Structured Streaming provide some
>compatibility layer as well?).
>
>More realistically Spark could provide a compatibility layer with
>some analytical tools that itself provide some PostgreSQL 
> compatibility,
>but these are not always fully compatible with upstream PostgreSQL, nor
>necessarily follow the latest PostgreSQL development.
>
>Furthermore compatibility layer can be, within certain limits
>(i.e. availability of required primitives), maintained as a separate
>project, without putting more strain on existing resources. Effectively
>what we care about here is if we can translate certain SQL string into
>logical or physical plan.
>
>
> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>
> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark
> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several
>>

Re: [DISCUSS] PostgreSQL dialect

2019-11-27 Thread Dongjoon Hyun
+1

Bests,
Dongjoon.

On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro 
wrote:

> Yea, +1, that looks pretty reasonable to me.
> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
> from the codebase before it's too late. Curently we only have 3 features
> under PostgreSQL dialect:
> I personally think we could at least stop work about the Dialect until 3.0
> released.
>
>
> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
> gengliang.w...@databricks.com> wrote:
>
>> +1 with the practical proposal.
>> To me, the major concern is that the code base becomes complicated, while
>> the PostgreSQL dialect has very limited features. I tried introducing one
>> big flag `spark.sql.dialect` and isolating related code in #25697
>> , but it seems hard to be
>> clean.
>> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
>> mode, which can be confusing sometimes.
>>
>> Gengliang
>>
>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li  wrote:
>>
>>> +1
>>>
>>>
 One particular negative effect has been that new postgresql tests add
 well over an hour to tests,
>>>
>>>
>>> Adding postgresql tests is for improving the test coverage of Spark SQL.
>>> We should continue to do this by importing more test cases. The quality of
>>> Spark highly depends on the test coverage. We can further paralyze the test
>>> execution to reduce the test time.
>>>
>>> Migrating PostgreSQL workloads to Spark SQL
>>>
>>>
>>> This should not be our current focus. In the near future, it is
>>> impossible to be fully compatible with PostgreSQL. We should focus on
>>> adding features that are useful to Spark community. PostgreSQL is a good
>>> reference, but we do not need to blindly follow it. We already closed
>>> multiple related JIRAs that try to add some PostgreSQL features that are
>>> not commonly used.
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>>> mszymkiew...@gmail.com> wrote:
>>>
 I think it is important to distinguish between two different concepts:

- Adherence to standards and their well established implementations.
- Enabling migrations from some product X to Spark.

 While these two problems are related, there are independent and one can
 be achieved without the other.

- The former approach doesn't imply that all features of SQL
standard (or its specific implementation) are provided. It is sufficient
that commonly used features that are implemented, are standard 
 compliant.
Therefore if end user applies some well known pattern, thing will work 
 as
expected. I

In my personal opinion that's something that is worth the required
development resources, and in general should happen within the project.


- The latter one is more complicated. First of all the premise that
one can "migrate PostgreSQL workloads to Spark" seems to be flawed. 
 While
both Spark and PostgreSQL evolve, and probably have more in common 
 today,
than a few years ago, they're not even close enough to pretend that one 
 can
be replacement for the other. In contrast, existing compatibility layers
between major vendors make sense, because feature disparity (at
least when it comes to core functionality) is usually minimal. And that
doesn't even touch the problem that PostgreSQL provides extensively used
extension points that enable broad and evolving ecosystem (what should 
 we
do about continuous queries? Should Structured Streaming provide some
compatibility layer as well?).

More realistically Spark could provide a compatibility layer with
some analytical tools that itself provide some PostgreSQL compatibility,
but these are not always fully compatible with upstream PostgreSQL, nor
necessarily follow the latest PostgreSQL development.

Furthermore compatibility layer can be, within certain limits (i.e.
availability of required primitives), maintained as a separate project,
without putting more strain on existing resources. Effectively what we 
 care
about here is if we can translate certain SQL string into logical or
physical plan.


 On 11/26/19 3:26 PM, Wenchen Fan wrote:

 Hi all,

 Recently we start an effort to achieve feature parity between Spark and
 PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

 This goes very well. We've added many missing features(parser rules,
 built-in functions, etc.) to Spark, and also corrected several
 inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
 Many thanks to all the people that contribute to it!

 There are several cases when adding a PostgreSQL feature:
 1. Spark doesn't have this feature:

Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Takeshi Yamamuro
Yea, +1, that looks pretty reasonable to me.
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
from the codebase before it's too late. Curently we only have 3 features
under PostgreSQL dialect:
I personally think we could at least stop work about the Dialect until 3.0
released.


On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
gengliang.w...@databricks.com> wrote:

> +1 with the practical proposal.
> To me, the major concern is that the code base becomes complicated, while
> the PostgreSQL dialect has very limited features. I tried introducing one
> big flag `spark.sql.dialect` and isolating related code in #25697
> , but it seems hard to be
> clean.
> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
> mode, which can be confusing sometimes.
>
> Gengliang
>
> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li  wrote:
>
>> +1
>>
>>
>>> One particular negative effect has been that new postgresql tests add
>>> well over an hour to tests,
>>
>>
>> Adding postgresql tests is for improving the test coverage of Spark SQL.
>> We should continue to do this by importing more test cases. The quality of
>> Spark highly depends on the test coverage. We can further paralyze the test
>> execution to reduce the test time.
>>
>> Migrating PostgreSQL workloads to Spark SQL
>>
>>
>> This should not be our current focus. In the near future, it is
>> impossible to be fully compatible with PostgreSQL. We should focus on
>> adding features that are useful to Spark community. PostgreSQL is a good
>> reference, but we do not need to blindly follow it. We already closed
>> multiple related JIRAs that try to add some PostgreSQL features that are
>> not commonly used.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
>> mszymkiew...@gmail.com> wrote:
>>
>>> I think it is important to distinguish between two different concepts:
>>>
>>>- Adherence to standards and their well established implementations.
>>>- Enabling migrations from some product X to Spark.
>>>
>>> While these two problems are related, there are independent and one can
>>> be achieved without the other.
>>>
>>>- The former approach doesn't imply that all features of SQL
>>>standard (or its specific implementation) are provided. It is sufficient
>>>that commonly used features that are implemented, are standard compliant.
>>>Therefore if end user applies some well known pattern, thing will work as
>>>expected. I
>>>
>>>In my personal opinion that's something that is worth the required
>>>development resources, and in general should happen within the project.
>>>
>>>
>>>- The latter one is more complicated. First of all the premise that
>>>one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While
>>>both Spark and PostgreSQL evolve, and probably have more in common today,
>>>than a few years ago, they're not even close enough to pretend that one 
>>> can
>>>be replacement for the other. In contrast, existing compatibility layers
>>>between major vendors make sense, because feature disparity (at
>>>least when it comes to core functionality) is usually minimal. And that
>>>doesn't even touch the problem that PostgreSQL provides extensively used
>>>extension points that enable broad and evolving ecosystem (what should we
>>>do about continuous queries? Should Structured Streaming provide some
>>>compatibility layer as well?).
>>>
>>>More realistically Spark could provide a compatibility layer with
>>>some analytical tools that itself provide some PostgreSQL compatibility,
>>>but these are not always fully compatible with upstream PostgreSQL, nor
>>>necessarily follow the latest PostgreSQL development.
>>>
>>>Furthermore compatibility layer can be, within certain limits (i.e.
>>>availability of required primitives), maintained as a separate project,
>>>without putting more strain on existing resources. Effectively what we 
>>> care
>>>about here is if we can translate certain SQL string into logical or
>>>physical plan.
>>>
>>>
>>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>>
>>> Hi all,
>>>
>>> Recently we start an effort to achieve feature parity between Spark and
>>> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>>
>>> This goes very well. We've added many missing features(parser rules,
>>> built-in functions, etc.) to Spark, and also corrected several
>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>> Many thanks to all the people that contribute to it!
>>>
>>> There are several cases when adding a PostgreSQL feature:
>>> 1. Spark doesn't have this feature: just add it.
>>> 2. Spark has this feature, but the behavior is different:
>>> 2.1 Spark's behavior doesn't make sense: change it to follow SQL
>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>>   

Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Gengliang Wang
+1 with the practical proposal.
To me, the major concern is that the code base becomes complicated, while
the PostgreSQL dialect has very limited features. I tried introducing one
big flag `spark.sql.dialect` and isolating related code in #25697
, but it seems hard to be clean.
Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
mode, which can be confusing sometimes.

Gengliang

On Tue, Nov 26, 2019 at 8:57 AM Xiao Li  wrote:

> +1
>
>
>> One particular negative effect has been that new postgresql tests add
>> well over an hour to tests,
>
>
> Adding postgresql tests is for improving the test coverage of Spark SQL.
> We should continue to do this by importing more test cases. The quality of
> Spark highly depends on the test coverage. We can further paralyze the test
> execution to reduce the test time.
>
> Migrating PostgreSQL workloads to Spark SQL
>
>
> This should not be our current focus. In the near future, it is impossible
> to be fully compatible with PostgreSQL. We should focus on adding features
> that are useful to Spark community. PostgreSQL is a good reference, but we
> do not need to blindly follow it. We already closed multiple related JIRAs
> that try to add some PostgreSQL features that are not commonly used.
>
> Cheers,
>
> Xiao
>
>
> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz 
> wrote:
>
>> I think it is important to distinguish between two different concepts:
>>
>>- Adherence to standards and their well established implementations.
>>- Enabling migrations from some product X to Spark.
>>
>> While these two problems are related, there are independent and one can
>> be achieved without the other.
>>
>>- The former approach doesn't imply that all features of SQL standard
>>(or its specific implementation) are provided. It is sufficient that
>>commonly used features that are implemented, are standard compliant.
>>Therefore if end user applies some well known pattern, thing will work as
>>expected. I
>>
>>In my personal opinion that's something that is worth the required
>>development resources, and in general should happen within the project.
>>
>>
>>- The latter one is more complicated. First of all the premise that
>>one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While
>>both Spark and PostgreSQL evolve, and probably have more in common today,
>>than a few years ago, they're not even close enough to pretend that one 
>> can
>>be replacement for the other. In contrast, existing compatibility layers
>>between major vendors make sense, because feature disparity (at least
>>when it comes to core functionality) is usually minimal. And that doesn't
>>even touch the problem that PostgreSQL provides extensively used extension
>>points that enable broad and evolving ecosystem (what should we do about
>>continuous queries? Should Structured Streaming provide some compatibility
>>layer as well?).
>>
>>More realistically Spark could provide a compatibility layer with
>>some analytical tools that itself provide some PostgreSQL compatibility,
>>but these are not always fully compatible with upstream PostgreSQL, nor
>>necessarily follow the latest PostgreSQL development.
>>
>>Furthermore compatibility layer can be, within certain limits (i.e.
>>availability of required primitives), maintained as a separate project,
>>without putting more strain on existing resources. Effectively what we 
>> care
>>about here is if we can translate certain SQL string into logical or
>>physical plan.
>>
>>
>> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>>
>> Hi all,
>>
>> Recently we start an effort to achieve feature parity between Spark and
>> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>
>> This goes very well. We've added many missing features(parser rules,
>> built-in functions, etc.) to Spark, and also corrected several
>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>> Many thanks to all the people that contribute to it!
>>
>> There are several cases when adding a PostgreSQL feature:
>> 1. Spark doesn't have this feature: just add it.
>> 2. Spark has this feature, but the behavior is different:
>> 2.1 Spark's behavior doesn't make sense: change it to follow SQL
>> standard and PostgreSQL, with a legacy config to restore the behavior.
>> 2.2 Spark's behavior makes sense but violates SQL standard: change
>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is
>> enabled (default false).
>> 2.3 Spark's behavior makes sense and doesn't violate SQL standard:
>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
>> native dialect).
>>
>> The PostgreSQL dialect itself is a good idea. It can help users to
>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>> too. For example, DB2 provides an 

Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Xiao Li
+1


> One particular negative effect has been that new postgresql tests add well
> over an hour to tests,


Adding postgresql tests is for improving the test coverage of Spark SQL. We
should continue to do this by importing more test cases. The quality of
Spark highly depends on the test coverage. We can further paralyze the test
execution to reduce the test time.

Migrating PostgreSQL workloads to Spark SQL


This should not be our current focus. In the near future, it is impossible
to be fully compatible with PostgreSQL. We should focus on adding features
that are useful to Spark community. PostgreSQL is a good reference, but we
do not need to blindly follow it. We already closed multiple related JIRAs
that try to add some PostgreSQL features that are not commonly used.

Cheers,

Xiao


On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz 
wrote:

> I think it is important to distinguish between two different concepts:
>
>- Adherence to standards and their well established implementations.
>- Enabling migrations from some product X to Spark.
>
> While these two problems are related, there are independent and one can be
> achieved without the other.
>
>- The former approach doesn't imply that all features of SQL standard
>(or its specific implementation) are provided. It is sufficient that
>commonly used features that are implemented, are standard compliant.
>Therefore if end user applies some well known pattern, thing will work as
>expected. I
>
>In my personal opinion that's something that is worth the required
>development resources, and in general should happen within the project.
>
>
>- The latter one is more complicated. First of all the premise that
>one can "migrate PostgreSQL workloads to Spark" seems to be flawed. While
>both Spark and PostgreSQL evolve, and probably have more in common today,
>than a few years ago, they're not even close enough to pretend that one can
>be replacement for the other. In contrast, existing compatibility layers
>between major vendors make sense, because feature disparity (at least
>when it comes to core functionality) is usually minimal. And that doesn't
>even touch the problem that PostgreSQL provides extensively used extension
>points that enable broad and evolving ecosystem (what should we do about
>continuous queries? Should Structured Streaming provide some compatibility
>layer as well?).
>
>More realistically Spark could provide a compatibility layer with some
>analytical tools that itself provide some PostgreSQL compatibility, but
>these are not always fully compatible with upstream PostgreSQL, nor
>necessarily follow the latest PostgreSQL development.
>
>Furthermore compatibility layer can be, within certain limits (i.e.
>availability of required primitives), maintained as a separate project,
>without putting more strain on existing resources. Effectively what we care
>about here is if we can translate certain SQL string into logical or
>physical plan.
>
>
> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>
> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark and
> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several
> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
> Many thanks to all the people that contribute to it!
>
> There are several cases when adding a PostgreSQL feature:
> 1. Spark doesn't have this feature: just add it.
> 2. Spark has this feature, but the behavior is different:
> 2.1 Spark's behavior doesn't make sense: change it to follow SQL
> standard and PostgreSQL, with a legacy config to restore the behavior.
> 2.2 Spark's behavior makes sense but violates SQL standard: change the
> behavior to follow SQL standard and PostgreSQL, when the ansi mode is
> enabled (default false).
> 2.3 Spark's behavior makes sense and doesn't violate SQL standard:
> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
> native dialect).
>
> The PostgreSQL dialect itself is a good idea. It can help users to migrate
> PostgreSQL workloads to Spark. Other databases have this strategy too. For
> example, DB2 provides an oracle dialect
> 
> .
>
> However, there are so many differences between Spark and PostgreSQL,
> including SQL parsing, type coercion, function/operator behavior, data
> types, etc. I'm afraid that we may spend a lot of effort on it, and make
> the Spark codebase pretty complicated, but still not able to provide a
> usable PostgreSQL dialect.
>
> Furthermore, it's not clear to me how many users have the requirement of
> migrating PostgreSQL workloads. I think it's much more important to make
> Spark ANSI-compliant fir

Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Maciej Szymkiewicz
I think it is important to distinguish between two different concepts:

  * Adherence to standards and their well established implementations.
  * Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can
be achieved without the other.

  * The former approach doesn't imply that all features of SQL standard
(or its specific implementation) are provided. It is sufficient that
commonly used features that are implemented, are standard compliant.
Therefore if end user applies some well known pattern, thing will
work as expected. I

In my personal opinion that's something that is worth the required
development resources, and in general should happen within the project.

  * The latter one is more complicated. First of all the premise that
one can "migrate PostgreSQL workloads to Spark" seems to be flawed.
While both Spark and PostgreSQL evolve, and probably have more in
common today, than a few years ago, they're not even close enough to
pretend that one can be replacement for the other. In contrast,
existing compatibility layers between major vendors make sense,
because feature disparity (at least when it comes to core
functionality) is usually minimal. And that doesn't even touch the
problem that PostgreSQL provides extensively used extension points
that enable broad and evolving ecosystem (what should we do about
continuous queries? Should Structured Streaming provide some
compatibility layer as well?).

More realistically Spark could provide a compatibility layer with
some analytical tools that itself provide some PostgreSQL
compatibility, but these are not always fully compatible with
upstream PostgreSQL, nor necessarily follow the latest PostgreSQL
development.

Furthermore compatibility layer can be, within certain limits (i.e.
availability of required primitives), maintained as a separate
project, without putting more strain on existing resources.
Effectively what we care about here is if we can translate certain
SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark
> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several
> inappropriate behaviors of Spark to follow SQL standard and
> PostgreSQL. Many thanks to all the people that contribute to it!
>
> There are several cases when adding a PostgreSQL feature:
> 1. Spark doesn't have this feature: just add it.
> 2. Spark has this feature, but the behavior is different:
>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
> standard and PostgreSQL, with a legacy config to restore the behavior.
>     2.2 Spark's behavior makes sense but violates SQL standard: change
> the behavior to follow SQL standard and PostgreSQL, when the ansi mode
> is enabled (default false).
>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
> adds the PostgreSQL behavior under the PostgreSQL dialect (default is
> Spark native dialect).
>
> The PostgreSQL dialect itself is a good idea. It can help users to
> migrate PostgreSQL workloads to Spark. Other databases have this
> strategy too. For example, DB2 provides an oracle dialect
> .
>
> However, there are so many differences between Spark and PostgreSQL,
> including SQL parsing, type coercion, function/operator behavior, data
> types, etc. I'm afraid that we may spend a lot of effort on it, and
> make the Spark codebase pretty complicated, but still not able to
> provide a usable PostgreSQL dialect.
>
> Furthermore, it's not clear to me how many users have the requirement
> of migrating PostgreSQL workloads. I think it's much more important to
> make Spark ANSI-compliant first, which doesn't need that much of work.
>
> Recently I've seen multiple PRs adding PostgreSQL cast functions,
> while our own cast function is not ANSI-compliant yet. This makes me
> think that, we should do something to properly prioritize ANSI mode
> over other dialects.
>
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
> from the codebase before it's too late. Curently we only have 3
> features under PostgreSQL dialect:
> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are
> also allowed as true string.
> 2. `date - date`  returns interval in Spark (SQL standard behavior),
> but return int in PostgreSQL
> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
> (there is no standard)
>
> We should still add PostgreSQL features that Spark doesn't have, or
> Spark's behavior violates SQL standard. But for others, let's just
> update the answer

Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Sean Owen
Without knowing much about it, I have had the same question. How much is
how important about this to justify the effort? One particular negative
effect has been that new postgresql tests add well over an hour to tests,
IIRC. So, tend to agree about drawing any reasonable line on compatibility
and maybe focusing elsewhere

On Tue, Nov 26, 2019, 8:26 AM Wenchen Fan  wrote:

> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark and
> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several
> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
> Many thanks to all the people that contribute to it!
>
> There are several cases when adding a PostgreSQL feature:
> 1. Spark doesn't have this feature: just add it.
> 2. Spark has this feature, but the behavior is different:
> 2.1 Spark's behavior doesn't make sense: change it to follow SQL
> standard and PostgreSQL, with a legacy config to restore the behavior.
> 2.2 Spark's behavior makes sense but violates SQL standard: change the
> behavior to follow SQL standard and PostgreSQL, when the ansi mode is
> enabled (default false).
> 2.3 Spark's behavior makes sense and doesn't violate SQL standard:
> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
> native dialect).
>
> The PostgreSQL dialect itself is a good idea. It can help users to migrate
> PostgreSQL workloads to Spark. Other databases have this strategy too. For
> example, DB2 provides an oracle dialect
> 
> .
>
> However, there are so many differences between Spark and PostgreSQL,
> including SQL parsing, type coercion, function/operator behavior, data
> types, etc. I'm afraid that we may spend a lot of effort on it, and make
> the Spark codebase pretty complicated, but still not able to provide a
> usable PostgreSQL dialect.
>
> Furthermore, it's not clear to me how many users have the requirement of
> migrating PostgreSQL workloads. I think it's much more important to make
> Spark ANSI-compliant first, which doesn't need that much of work.
>
> Recently I've seen multiple PRs adding PostgreSQL cast functions, while
> our own cast function is not ANSI-compliant yet. This makes me think that,
> we should do something to properly prioritize ANSI mode over other dialects.
>
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
> from the codebase before it's too late. Curently we only have 3 features
> under PostgreSQL dialect:
> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
> allowed as true string.
> 2. `date - date`  returns interval in Spark (SQL standard behavior), but
> return int in PostgreSQL
> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
> (there is no standard)
>
> We should still add PostgreSQL features that Spark doesn't have, or
> Spark's behavior violates SQL standard. But for others, let's just update
> the answer files of PostgreSQL tests.
>
> Any comments are welcome!
>
> Thanks,
> Wenchen
>


[DISCUSS] PostgreSQL dialect

2019-11-26 Thread Wenchen Fan
Hi all,

Recently we start an effort to achieve feature parity between Spark and
PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules,
built-in functions, etc.) to Spark, and also corrected several
inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
2.1 Spark's behavior doesn't make sense: change it to follow SQL
standard and PostgreSQL, with a legacy config to restore the behavior.
2.2 Spark's behavior makes sense but violates SQL standard: change the
behavior to follow SQL standard and PostgreSQL, when the ansi mode is
enabled (default false).
2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds
the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate
PostgreSQL workloads to Spark. Other databases have this strategy too. For
example, DB2 provides an oracle dialect

.

However, there are so many differences between Spark and PostgreSQL,
including SQL parsing, type coercion, function/operator behavior, data
types, etc. I'm afraid that we may spend a lot of effort on it, and make
the Spark codebase pretty complicated, but still not able to provide a
usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of
migrating PostgreSQL workloads. I think it's much more important to make
Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our
own cast function is not ANSI-compliant yet. This makes me think that, we
should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from
the codebase before it's too late. Curently we only have 3 features under
PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but
return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL.
(there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's
behavior violates SQL standard. But for others, let's just update the
answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen