from:"Ryan Blue"

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Ryan Blue

+1

On Mon, May 13, 2024 at 12:31 AM Mich Talebzadeh 
wrote:

> +0
>
> For reasons I outlined in the discussion thread
>
> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 13 May 2024 at 08:24, Wenchen Fan  wrote:
>
>> +1
>>
>> On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:
>>
>>> +1
>>>
>>> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
>>> >
>>> > +1
>>> >
>>> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>>> >>> >
>>> >>> > +1
>>> >>> >
>>> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
>>> wrote:
>>> >>> >>
>>> >>> >> Hi all,
>>> >>> >>
>>> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
>>> Catalogs.
>>> >>> >>
>>> >>> >> Please also refer to:
>>> >>> >>
>>> >>> >>- Discussion thread:
>>> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>> >>> >>- JIRA ticket:
>>> https://issues.apache.org/jira/browse/SPARK-44167
>>> >>> >>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>> >>> >>
>>> >>> >>
>>> >>> >> Please vote on the SPIP for the next 72 hours:
>>> >>> >>
>>> >>> >> [ ] +1: Accept the proposal as an official SPIP
>>> >>> >> [ ] +0
>>> >>> >> [ ] -1: I don’t think this is a good idea because …
>>> >>> >>
>>> >>> >>
>>> >>> >> Thank you!
>>> >>> >>
>>> >>> >> Liang-Chi Hsieh
>>> >>> >>
>>> >>> >>
>>> -
>>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>> >>
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
Ryan Blue
Tabular

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Ryan Blue

;>>- created_by: parquet-mr version 1.12.3: While this doesn't directly
>>>specify the format version, itt is accepted that older versions of
>>>parquet-mr like 1.12.3 typically write Parquet version 1 files.
>>>
>>> Since in this case Spark 3.4 is capable of reading both versions (1 and
>>> 2), you don't  necessarily need to modify your Spark code to access this
>>> file. However, if you want to create Parquet files in version 2 using
>>> Spark, you might need to consider additional changes like excluding
>>> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark.
>>> However, taking klaws of diminishing returns, I would not advise that
>>> either.. You can ofcourse usse gzip for compression that may be more
>>> suitable for your needs.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Tue, 16 Apr 2024 at 15:00, Prem Sahoo  wrote:
>>>
>>>> Hello Community,
>>>> Could any of you shed some light on below questions please ?
>>>> Sent from my iPhone
>>>>
>>>> On Apr 15, 2024, at 9:02 PM, Prem Sahoo  wrote:
>>>>
>>>> 
>>>> Any specific reason spark does not support or community doesn't want to
>>>> go to Parquet V2 , which is more optimized and read and write is too much
>>>> faster (form other component which I am using)
>>>>
>>>> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue  wrote:
>>>>
>>>>> Spark will read data written with v2 encodings just fine. You just
>>>>> don't need to worry about making Spark produce v2. And you should probably
>>>>> also not produce v2 encodings from other systems.
>>>>>
>>>>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo 
>>>>> wrote:
>>>>>
>>>>>> oops but so spark does not support parquet V2  atm ?, as We have a
>>>>>> use case where we need parquet V2 as  one of our components uses Parquet 
>>>>>> V2
>>>>>> .
>>>>>>
>>>>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:
>>>>>>
>>>>>>> Hi Prem,
>>>>>>>
>>>>>>> Parquet v1 is the default because v2 has not been finalized and
>>>>>>> adopted by the community. I highly recommend not using v2 encodings at 
>>>>>>> this
>>>>>>> time.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am using spark 3.2.0 . but my spark package comes with parquet-mr
>>>>>>>> 1.2.1 which writes in parquet version 1 not version version 2:(. so I 
>>>>>>>> was
>>>>>>>> looking how to write in Parquet version2 ?
>>>>>>>>
>>>>>>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Sorry you have a point there. It was released in version 3.00.
>>>>>>>>> What version of spark are you using?
>>>>>>>>>
>>>>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue

Spark will read data written with v2 encodings just fine. You just don't
need to worry about making Spark produce v2. And you should probably also
not produce v2 encodings from other systems.

On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo  wrote:

> oops but so spark does not support parquet V2  atm ?, as We have a use
> case where we need parquet V2 as  one of our components uses Parquet V2 .
>
> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue  wrote:
>
>> Hi Prem,
>>
>> Parquet v1 is the default because v2 has not been finalized and adopted
>> by the community. I highly recommend not using v2 encodings at this time.
>>
>> Ryan
>>
>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:
>>
>>> I am using spark 3.2.0 . but my spark package comes with parquet-mr
>>> 1.2.1 which writes in parquet version 1 not version version 2:(. so I was
>>> looking how to write in Parquet version2 ?
>>>
>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Sorry you have a point there. It was released in version 3.00. What
>>>> version of spark are you using?
>>>>
>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:
>>>>
>>>>> Thank you so much for the info! But do we have any release notes where
>>>>> it says spark2.4.0 onwards supports parquet version 2. I was under the
>>>>> impression Spark3.0 onwards it started supporting .
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Well if I am correct, Parquet version 2 support was introduced in
>>>>>> Spark version 2.4.0. Therefore, any version of Spark starting from 2.4.0
>>>>>> supports Parquet version 2. Assuming that you are using Spark version
>>>>>> 2.4.0 or later, you should be able to take advantage of Parquet version 2
>>>>>> features.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo 
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you for the information!
>>>>>>> I can use any version of parquet-mr to produce parquet file.
>>>>>>>
>>>>>>> regarding 2nd question .
>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>> May I get the release notes where parquet versions

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue

Hi Prem,

Parquet v1 is the default because v2 has not been finalized and adopted by
the community. I highly recommend not using v2 encodings at this time.

Ryan

On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo  wrote:

> I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1
> which writes in parquet version 1 not version version 2:(. so I was looking
> how to write in Parquet version2 ?
>
> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh 
> wrote:
>
>> Sorry you have a point there. It was released in version 3.00. What
>> version of spark are you using?
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo  wrote:
>>
>>> Thank you so much for the info! But do we have any release notes where
>>> it says spark2.4.0 onwards supports parquet version 2. I was under the
>>> impression Spark3.0 onwards it started supporting .
>>>
>>>
>>>
>>>
>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Well if I am correct, Parquet version 2 support was introduced in Spark
>>>> version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports
>>>> Parquet version 2. Assuming that you are using Spark version  2.4.0 or
>>>> later, you should be able to take advantage of Parquet version 2 features.
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo  wrote:
>>>>
>>>>> Thank you for the information!
>>>>> I can use any version of parquet-mr to produce parquet file.
>>>>>
>>>>> regarding 2nd question .
>>>>> Which version of spark is supporting parquet version 2?
>>>>> May I get the release notes where parquet versions are mentioned ?
>>>>>
>>>>>
>>>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Parquet-mr is a Java library that provides functionality for working
>>>>>> with Parquet files with hadoop. It is therefore  more geared towards
>>>>>> working with Parquet files within the Hadoop ecosystem, particularly 
>>>>>> using
>>>>>> MapReduce jobs. There is no definitive way to check exact compatible
>>>>>> versions within the library itself. However, you can have a look at this
>>>>>>
>>>>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Team,
>>>>>>> May I know how to check which version of parquet is supported by
>>>>>>> parquet-mr 1.2.1 ?
>>>>>>>
>>>>>>> Which version of parquet-mr is supporting parquet version 2 (V2) ?
>>>>>>>
>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>> May I get the release notes where parquet versions are mentioned ?
>>>>>>>
>>>>>>

-- 
Ryan Blue
Tabular

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread Ryan Blue

+1

On Thu, Nov 9, 2023 at 4:23 PM Hussein Awala  wrote:

> +1 for creating an official Kubernetes operator for Apache Spark
>
> On Fri, Nov 10, 2023 at 12:38 AM huaxin gao 
> wrote:
>
>> +1
>>
>> On Thu, Nov 9, 2023 at 3:14 PM DB Tsai  wrote:
>>
>>> +1
>>>
>>> To be completely transparent, I am employed in the same department as
>>> Zhou at Apple.
>>>
>>> I support this proposal, provided that we witness community adoption
>>> following the release of the Flink Kubernetes operator, streamlining Flink
>>> deployment on Kubernetes.
>>>
>>> A well-maintained official Spark Kubernetes operator is essential for
>>> our Spark community as well.
>>>
>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>
>>> On Nov 9, 2023, at 12:05 PM, Zhou Jiang  wrote:
>>>
>>> Hi Spark community,
>>> I'm reaching out to initiate a conversation about the possibility of
>>> developing a Java-based Kubernetes operator for Apache Spark. Following the
>>> operator pattern (
>>> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark
>>> users may manage applications and related components seamlessly using
>>> native tools like kubectl. The primary goal is to simplify the Spark user
>>> experience on Kubernetes, minimizing the learning curve and operational
>>> complexities and therefore enable users to focus on the Spark application
>>> development.
>>> Although there are several open-source Spark on Kubernetes operators
>>> available, none of them are officially integrated into the Apache Spark
>>> project. As a result, these operators may lack active support and
>>> development for new features. Within this proposal, our aim is to introduce
>>> a Java-based Spark operator as an integral component of the Apache Spark
>>> project. This solution has been employed internally at Apple for multiple
>>> years, operating millions of executors in real production environments. The
>>> use of Java in this solution is intended to accommodate a wider user and
>>> contributor audience, especially those who are familiar with Scala.
>>> Ideally, this operator should have its dedicated repository, similar to
>>> Spark Connect Golang or Spark Docker, allowing it to maintain a loose
>>> connection with the Spark release cycle. This model is also followed by the
>>> Apache Flink Kubernetes operator.
>>> We believe that this project holds the potential to evolve into a
>>> thriving community project over the long run. A comparison can be drawn
>>> with the Flink Kubernetes Operator: Apple has open-sourced internal Flink
>>> Kubernetes operator, making it a part of the Apache Flink project (
>>> https://github.com/apache/flink-kubernetes-operator). This move has
>>> gained wide industry adoption and contributions from the community. In a
>>> mere year, the Flink operator has garnered more than 600 stars and has
>>> attracted contributions from over 80 contributors. This showcases the level
>>> of community interest and collaborative momentum that can be achieved in
>>> similar scenarios.
>>> More details can be found at SPIP doc : Spark Kubernetes Operator
>>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>>
>>> Thanks,
>>> --
>>> *Zhou JIANG*
>>>
>>>
>>>

-- 
Ryan Blue
Tabular

Re: Query hints visible to DSV2 connectors?

2023-08-03 Thread Ryan Blue

You probably want to use data source options. Those get passed through but
can't be set in SQL.

On Wed, Aug 2, 2023 at 5:39 PM Alex Cruise  wrote:

> Hey folks,
>
> I'm adding an optional feature to my DSV2 connector where it can choose
> between a row-based or columnar PartitionReader dynamically depending on a
> query's schema. I'd like to be able to supply a hint at query time that's
> visible to the connector, but at the moment I can't see any way to
> accomplish that.
>
> From what I can see the artifacts produced by the existing hint system [
> https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html
> or sql("select 1").hint("foo").show()] aren't visible from the
> TableCatalog/Table/ScanBuilder.
>
> I guess I could set a config parameter but I'd rather do this on a
> per-query basis. Any tips?
>
> Thanks!
>
> -0xe1a
>


-- 
Ryan Blue
Tabular

Re: Data Contracts

2023-06-12 Thread Ryan Blue

Hey Phillip,

You're right that we can improve tooling to help with data contracts, but I
think that a contract still needs to be an agreement between people.
Constraints help by helping to ensure a data producer adheres to the
contract and gives feedback as soon as possible when assumptions are
violated. The problem with considering that the only contract is that it's
too easy to change it. For example, if I change a required column to a
nullable column, that's a perfectly valid transition --- but only if I've
communicated that change to downstream consumers.

Ryan

On Mon, Jun 12, 2023 at 4:43 AM Phillip Henry 
wrote:

> Hi, folks.
>
> There currently seems to be a buzz around "data contracts". From what I
> can tell, these mainly advocate a cultural solution. But instead, could big
> data tools be used to enforce these contracts?
>
> My questions really are: are there any plans to implement data constraints
> in Spark (eg, an integer must be between 0 and 100; the date in column X
> must be before that in column Y)? And if not, is there an appetite for them?
>
> Maybe we could associate constraints with schema metadata that are
> enforced in the implementation of a FileFormatDataWriter?
>
> Just throwing it out there and wondering what other people think. It's an
> area that interests me as it seems that over half my problems at the day
> job are because of dodgy data.
>
> Regards,
>
> Phillip
>
>

-- 
Ryan Blue
Tabular

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Ryan Blue

+1 for the SPIP. I think it's well designed and it has worked quite well at
Netflix for a long time.

On Thu, Feb 3, 2022 at 2:04 PM John Zhuge  wrote:

> Hi Spark community,
>
> I’d like to restart the vote for the ViewCatalog design proposal (SPIP).
>
> The proposal is to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> Please vote on the SPIP until Feb. 9th (Wednesday).
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>


-- 
Ryan Blue
Tabular

Re: [VOTE][SPIP] Support Customized Kubernetes Schedulers Proposal

2022-01-12 Thread Ryan Blue

+1 (non-binding)

On Wed, Jan 12, 2022 at 10:29 AM Mridul Muralidharan 
wrote:

>
> +1 (binding)
> This should be a great improvement !
>
> Regards,
> Mridul
>
> On Wed, Jan 12, 2022 at 4:04 AM Kent Yao  wrote:
>
>> +1 (non-binding)
>>
>> Thomas Graves  于2022年1月12日周三 11:52写道：
>>
>>> +1 (binding).
>>>
>>> One minor note since I haven't had time to look at the implementation
>>> details is please make sure resource aware scheduling and the stage
>>> level scheduling still work or any caveats are documented. Feel free
>>> to ping me if questions in these areas.
>>>
>>> Tom
>>>
>>> On Wed, Jan 5, 2022 at 7:07 PM Yikun Jiang  wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I’d like to start a vote for SPIP: "Support Customized Kubernetes
>>> Schedulers Proposal"
>>> >
>>> > The SPIP is to support customized Kubernetes schedulers in Spark on
>>> Kubernetes.
>>> >
>>> > Please also refer to:
>>> >
>>> > - Previous discussion in dev mailing list: [DISCUSSION] SPIP: Support
>>> Volcano/Alternative Schedulers Proposal
>>> > - Design doc: [SPIP] Spark-36057 Support Customized Kubernetes
>>> Schedulers Proposal
>>> > - JIRA: SPARK-36057
>>> >
>>> > Please vote on the SPIP:
>>> >
>>> > [ ] +1: Accept the proposal as an official SPIP
>>> > [ ] +0
>>> > [ ] -1: I don’t think this is a good idea because …
>>> >
>>> > Regards,
>>> > Yikun
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
Ryan Blue
Tabular

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Ryan Blue

Mich, time travel will use the newly added VERSION AS OF or TIMESTAMP AS OF
syntax.

On Tue, Nov 16, 2021 at 12:40 AM Mich Talebzadeh 
wrote:

> As I stated before, hints are designed to direct the optimizer to choose
> a certain query execution plan based on the specific criteria.
>
>
> -- time travel
> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>
>
> The alternative would be to specify time travel by creating a snapshot
> based on CURRENT_DATE() range which encapsulates time travel for
> 'snapshot-id'='10963874102873L'
>
>
> CREATE SNAPSHOT t_snap
>
>   START WITH CURRENT_DATE() - 30
>
>   NEXT CURRENT_DATE()
>
>   AS SELECT * FROM t
>
>
> SELECT * FROM t_snap
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 16 Nov 2021 at 04:26, Hyukjin Kwon  wrote:
>
>> My biggest concern with the syntax in hints is that Spark SQL's options
>> can change results (e.g., CSV's header options) whereas hints are generally
>> not designed to affect the external results if I am not mistaken. This is
>> counterintuitive.
>> I left the comment in the PR but what's the real benefit over leveraging:
>> SET conf and RESET conf? we can extract options from runtime session
>> configurations e.g., SessionConfigSupport.
>>
>> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Side note about time travel: There is a PR
>>> <https://github.com/apache/spark/pull/34497> to add VERSION/TIMESTAMP
>>> AS OF syntax to Spark SQL.
>>>
>>> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:
>>>
>>>> I want to note that I wouldn't recommend time traveling this way by
>>>> using the hint for `snapshot-id`. Instead, we want to add the standard SQL
>>>> syntax for that in a separate PR. This is useful for other options that
>>>> help a table scan perform better, like specifying the target split size.
>>>>
>>>> You're right that this isn't a typical optimizer hint, but I'm not sure
>>>> what other syntax is possible for this use case. How else would we send
>>>> custom properties through to the scan?
>>>>
>>>> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> I am looking at the hint and it appears to me (I stand corrected), it
>>>>> is a single table hint as below:
>>>>>
>>>>> -- time travel
>>>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>>>
>>>>> My assumption is that any view on this table will also benefit from
>>>>> this hint. This is not a hint to optimizer in a classical sense. Only a
>>>>> snapshot hint. Normally, a hint is an instruction to the optimizer.
>>>>> When writing SQL, one may know information about the data unknown to the
>>>>> optimizer. Hints enable one to make decisions normally made by the
>>>>> optimizer, sometimes causing the optimizer to select a plan that it sees 
>>>>> as
>>>>> higher cost.
>>>>>
>>>>>
>>>>> So far as this case is concerned, it looks OK and I concur it should
>>>>> be extended to write as well.
>>>>>
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer <

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Ryan Blue

I want to note that I wouldn't recommend time traveling this way by using
the hint for `snapshot-id`. Instead, we want to add the standard SQL syntax
for that in a separate PR. This is useful for other options that help a
table scan perform better, like specifying the target split size.

You're right that this isn't a typical optimizer hint, but I'm not sure
what other syntax is possible for this use case. How else would we send
custom properties through to the scan?

On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh 
wrote:

> I am looking at the hint and it appears to me (I stand corrected), it is a
> single table hint as below:
>
> -- time travel
> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>
> My assumption is that any view on this table will also benefit from this
> hint. This is not a hint to optimizer in a classical sense. Only a snapshot
> hint. Normally, a hint is an instruction to the optimizer. When writing
> SQL, one may know information about the data unknown to the optimizer.
> Hints enable one to make decisions normally made by the optimizer,
> sometimes causing the optimizer to select a plan that it sees as higher
> cost.
>
>
> So far as this case is concerned, it looks OK and I concur it should be
> extended to write as well.
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer 
> wrote:
>
>> I think since we probably will end up using this same syntax on write,
>> this makes a lot of sense. Unless there is another good way to express a
>> similar concept during a write operation I think going forward with this
>> would be ok.
>>
>> On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:
>>
>>> The proposed feature is to be able to pass options through SQL like you
>>> would when using the DataFrameReader API, so it would work for all
>>> sources that support read options. Read options are part of the DSv2 API,
>>> there just isn’t a way to pass options when using SQL. The PR also has a
>>> non-Iceberg example, which is being able to customize some JDBC source
>>> behaviors per query (e.g., fetchSize), rather than globally in the table’s
>>> options.
>>>
>>> The proposed syntax is odd, but I think that's an artifact of Spark
>>> introducing read options that aren't a normal part of SQL. Seems reasonable
>>> to me to pass them through a hint.
>>>
>>> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Interesting.
>>>>
>>>> What is this going to add on top of support for Apache Iceberg
>>>> <https://www.dremio.com/data-lake/apache-iceberg/>. Will it be in line
>>>> with support for Hive ACID tables or Delta Lake?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 15 Nov 2021 at 01:56, Zhun Wang  wrote:
>>>>
>>>>> Hi dev,
>>>>>
>>>>> We are discussing Support Dynamic Table Options for Spark SQL (
>>>>> https://github.com/apache/spark/pull/34072). It is currently not sure
>>>>> if the syntax makes sense, and would like to know if there is other
>>>>> feedback or opinion on this.
>>>>>
>>>>> I would appreciate any feedback on this.
>>>>>
>>>>> Thanks.
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Ryan Blue

The proposed feature is to be able to pass options through SQL like you
would when using the DataFrameReader API, so it would work for all sources
that support read options. Read options are part of the DSv2 API, there
just isn’t a way to pass options when using SQL. The PR also has a
non-Iceberg example, which is being able to customize some JDBC source
behaviors per query (e.g., fetchSize), rather than globally in the table’s
options.

The proposed syntax is odd, but I think that's an artifact of Spark
introducing read options that aren't a normal part of SQL. Seems reasonable
to me to pass them through a hint.

On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh 
wrote:

> Interesting.
>
> What is this going to add on top of support for Apache Iceberg
> <https://www.dremio.com/data-lake/apache-iceberg/>. Will it be in line
> with support for Hive ACID tables or Delta Lake?
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 01:56, Zhun Wang  wrote:
>
>> Hi dev,
>>
>> We are discussing Support Dynamic Table Options for Spark SQL (
>> https://github.com/apache/spark/pull/34072). It is currently not sure if
>> the syntax makes sense, and would like to know if there is other feedback
>> or opinion on this.
>>
>> I would appreciate any feedback on this.
>>
>> Thanks.
>>
>

-- 
Ryan Blue
Tabular

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-14 Thread Ryan Blue

+1

Thanks to Anton for all this great work!

On Sat, Nov 13, 2021 at 8:24 AM Mich Talebzadeh 
wrote:

> +1 non-binding
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 13 Nov 2021 at 15:07, Russell Spitzer 
> wrote:
>
>> +1 (never binding)
>>
>> On Sat, Nov 13, 2021 at 1:10 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Nov 12, 2021 at 6:58 PM huaxin gao 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Fri, Nov 12, 2021 at 6:44 PM Yufei Gu 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> > On Nov 12, 2021, at 6:25 PM, L. C. Hsieh  wrote:
>>>>> >
>>>>> > Hi all,
>>>>> >
>>>>> > I’d like to start a vote for SPIP: Row-level operations in Data
>>>>> Source V2.
>>>>> >
>>>>> > The proposal is to add support for executing row-level operations
>>>>> > such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
>>>>> > execution should be the same across data sources and the best way to
>>>>> do
>>>>> > that is to implement it in Spark.
>>>>> >
>>>>> > Right now, Spark can only parse and to some extent analyze DELETE,
>>>>> UPDATE,
>>>>> > MERGE commands. Data sources that support row-level changes have to
>>>>> build
>>>>> > custom Spark extensions to execute such statements. The goal of this
>>>>> effort
>>>>> > is to come up with a flexible and easy-to-use API that will work
>>>>> across
>>>>> > data sources.
>>>>> >
>>>>> > Please also refer to:
>>>>> >
>>>>> >   - Previous discussion in dev mailing list: [DISCUSS] SPIP:
>>>>> > Row-level operations in Data Source V2
>>>>> >   <https://lists.apache.org/thread/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv>
>>>>> >
>>>>> >   - JIRA: SPARK-35801 <
>>>>> https://issues.apache.org/jira/browse/SPARK-35801>
>>>>> >   - PR for handling DELETE statements:
>>>>> > <https://github.com/apache/spark/pull/33008>
>>>>> >
>>>>> >   - Design doc
>>>>> > <
>>>>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>>>>> >
>>>>> >
>>>>> > Please vote on the SPIP for the next 72 hours:
>>>>> >
>>>>> > [ ] +1: Accept the proposal as an official SPIP
>>>>> > [ ] +0
>>>>> > [ ] -1: I don’t think this is a good idea because …
>>>>> >
>>>>> > -
>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>> >
>>>>>
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>

-- 
Ryan Blue
Tabular

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-29 Thread Ryan Blue

+1

On Fri, Oct 29, 2021 at 11:06 AM huaxin gao  wrote:

> +1
>
> On Fri, Oct 29, 2021 at 10:59 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Dongjoon
>>
>> On 2021/10/29 17:48:59, Russell Spitzer 
>> wrote:
>> > +1 This is a great idea, (I have no Apache Spark voting points)
>> >
>> > On Fri, Oct 29, 2021 at 12:41 PM L. C. Hsieh  wrote:
>> >
>> > >
>> > > I'll start with my +1.
>> > >
>> > > On 2021/10/29 17:30:03, L. C. Hsieh  wrote:
>> > > > Hi all,
>> > > >
>> > > > I’d like to start a vote for SPIP: Storage Partitioned Join for Data
>> > > Source V2.
>> > > >
>> > > > The proposal is to support a new type of join: storage partitioned
>> join
>> > > which
>> > > > covers bucket join support for DataSourceV2 but is more general.
>> The goal
>> > > > is to let Spark leverage distribution properties reported by data
>> > > sources and
>> > > > eliminate shuffle whenever possible.
>> > > >
>> > > > Please also refer to:
>> > > >
>> > > >- Previous discussion in dev mailing list: [DISCUSS] SPIP:
>> Storage
>> > > Partitioned Join for Data Source V2
>> > > ><
>> > >
>> https://lists.apache.org/thread.html/r7dc67c3db280a8b2e65855cb0b1c86b524d4e6ae1ed9db9ca12cb2e6%40%3Cdev.spark.apache.org%3E
>> > > >
>> > > >.
>> > > >- JIRA: SPARK-37166 <
>> > > https://issues.apache.org/jira/browse/SPARK-37166>
>> > > >- Design doc <
>> > >
>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>> >
>> > >
>> > > >
>> > > > Please vote on the SPIP for the next 72 hours:
>> > > >
>> > > > [ ] +1: Accept the proposal as an official SPIP
>> > > > [ ] +0
>> > > > [ ] -1: I don’t think this is a good idea because …
>> > > >
>> > > >
>> -
>> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > > >
>> > > >
>> > >
>> > > -
>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > >
>> > >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Ryan Blue

The transform expressions in v2 are logical, not concrete implementations.
Even days may have different implementations -- the only expectation is
that the partitions are day-sized. For example, you could use a transform
that splits days at UTC 00:00, or uses some other day boundary.

Because the expressions are logical, we need to resolve them to
implementations at some point, like Chao outlines. We can do that using a
FunctionCatalog, although I think it's worth considering adding an
interface so that a transform from a Table can be converted into a
`BoundFunction` directly. That is easier than defining a way for Spark to
query the function catalog.

In any case, I'm sure it's easy to understand how this works once you get a
concrete implementation.

On Wed, Oct 27, 2021 at 9:35 AM Wenchen Fan  wrote:

> `BucketTransform` is a builtin partition transform in Spark, instead of a
> UDF from `FunctionCatalog`. Will Iceberg use UDF from `FunctionCatalog` to
> represent its bucket transform, or use the Spark builtin `BucketTransform`?
> I'm asking this because other v2 sources may also use the builtin
> `BucketTransform` but use a different bucket hash function. Or we can
> clearly define the bucket hash function of the builtin `BucketTransform` in
> the doc.
>
> On Thu, Oct 28, 2021 at 12:25 AM Ryan Blue  wrote:
>
>> Two v2 sources may return different bucket IDs for the same value, and
>> this breaks the phase 1 split-wise join.
>>
>> This is why the FunctionCatalog included a canonicalName method (docs
>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java#L81-L96>).
>> That method returns an identifier that can be used to compare whether two
>> bucket function instances are the same.
>>
>>
>>1. Can we apply this idea to partitioned file source tables
>>(non-bucketed) as well?
>>
>> What do you mean here? The design doc discusses transforms like days(ts)
>> that can be supported in the future. Is that what you’re asking about? Or
>> are you referring to v1 file sources? I think the goal is to support v2,
>> since v1 doesn’t have reliable behavior.
>>
>> Note that the initial implementation goal is to support bucketing since
>> that’s an easier case because both sides have the same number of
>> partitions. More complex storage-partitioned joins can be implemented later.
>>
>>
>>1. What if the table has many partitions? Shall we apply certain join
>>algorithms in the phase 1 split-wise join as well? Or even launch a Spark
>>job to do so?
>>
>> I think that this proposal opens up a lot of possibilities, like what
>> you’re suggesting here. It is a bit like AQE. We’ll need to come up with
>> heuristics for choosing how and when to use storage partitioning in joins.
>> As I said above, bucketing is a great way to get started because it fills
>> an existing gap. More complex use cases can be supported over time.
>>
>> Ryan
>>
>> On Wed, Oct 27, 2021 at 9:08 AM Wenchen Fan  wrote:
>>
>>> IIUC, the general idea is to let each input split report its partition
>>> value, and Spark can perform the join in two phases:
>>> 1. join the input splits from left and right tables according to their
>>> partitions values and join keys, at the driver side.
>>> 2. for each joined input splits pair (or a group of splits), launch a
>>> Spark task to join the rows.
>>>
>>> My major concern is about how to define "compatible partitions". Things
>>> like `days(ts)` are straightforward: the same timestamp value always
>>> results in the same partition value, in whatever v2 sources. `bucket(col,
>>> num)` is tricky, as Spark doesn't define the bucket hash function. Two v2
>>> sources may return different bucket IDs for the same value, and this breaks
>>> the phase 1 split-wise join.
>>>
>>> And two questions for further improvements:
>>> 1. Can we apply this idea to partitioned file source tables
>>> (non-bucketed) as well?
>>> 2. What if the table has many partitions? Shall we apply certain join
>>> algorithms in the phase 1 split-wise join as well? Or even launch a Spark
>>> job to do so?
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Wed, Oct 27, 2021 at 3:08 AM Chao Sun  wrote:
>>>
>>>> Thanks Cheng for the comments.
>>>>
>>>> > Is migrating Hive table read path to data source v2, being a
>>>> prerequisite of this SPIP
>>>>
>>>> Yes, this SPIP only aims at DataSource

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Ryan Blue

;>> ORC file format only), and legacy Hive code path (HiveTableScanExec). In
>>> the SPIP, I am seeing we only make change for data source v2, so wondering
>>> how this would work with existing Hive table read path. In addition, just
>>> FYI, supporting writing Hive bucketed table is merged in master recently (
>>> SPARK-19256 <https://issues.apache.org/jira/browse/SPARK-19256> has
>>> details).
>>>
>>>
>>>
>>>1. Would aggregate work automatically after the SPIP?
>>>
>>>
>>>
>>> Another major benefit for having bucketed table, is to avoid shuffle
>>> before aggregate. Just want to bring to our attention that it would be
>>> great to consider aggregate as well when doing this proposal.
>>>
>>>
>>>
>>>1. Any major use cases in mind except Hive bucketed table?
>>>
>>>
>>>
>>> Just curious if there’s any other use cases we are targeting as part of
>>> SPIP.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From: *Ryan Blue 
>>> *Date: *Tuesday, October 26, 2021 at 9:39 AM
>>> *To: *John Zhuge 
>>> *Cc: *Chao Sun , Wenchen Fan ,
>>> Cheng Su , DB Tsai , Dongjoon Hyun <
>>> dongjoon.h...@gmail.com>, Hyukjin Kwon , Wenchen
>>> Fan , angers zhu , dev <
>>> dev@spark.apache.org>, huaxin gao 
>>> *Subject: *Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source
>>> V2
>>>
>>> Instead of commenting on the doc, could we keep discussion here on the
>>> dev list please? That way more people can follow it and there is more room
>>> for discussion. Comment threads have a very small area and easily become
>>> hard to follow.
>>>
>>>
>>>
>>> Ryan
>>>
>>>
>>>
>>> On Tue, Oct 26, 2021 at 9:32 AM John Zhuge  wrote:
>>>
>>> +1  Nicely done!
>>>
>>>
>>>
>>> On Tue, Oct 26, 2021 at 8:08 AM Chao Sun  wrote:
>>>
>>> Oops, sorry. I just fixed the permission setting.
>>>
>>>
>>>
>>> Thanks everyone for the positive support!
>>>
>>>
>>>
>>> On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan  wrote:
>>>
>>> +1 to this SPIP and nice writeup of the design doc!
>>>
>>>
>>>
>>> Can we open comment permission in the doc so that we can discuss details
>>> there?
>>>
>>>
>>>
>>> On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon 
>>> wrote:
>>>
>>> Seems making sense to me.
>>>
>>> Would be great to have some feedback from people such as @Wenchen Fan
>>>  @Cheng Su  @angers zhu
>>> .
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun 
>>> wrote:
>>>
>>> +1 for this SPIP.
>>>
>>>
>>>
>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao 
>>> wrote:
>>>
>>> +1. Thanks for lifting the current restrictions on bucket join and
>>> making this more generalized.
>>>
>>>
>>>
>>> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:
>>>
>>> +1 from me as well. Thanks Chao for doing so much to get it to this
>>> point!
>>>
>>>
>>>
>>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
>>>
>>> +1 on this SPIP.
>>>
>>> This is a more generalized version of bucketed tables and bucketed
>>> joins which can eliminate very expensive data shuffles when joins, and
>>> many users in the Apache Spark community have wanted this feature for
>>> a long time!
>>>
>>> Thank you, Ryan and Chao, for working on this, and I look forward to
>>> it as a new feature in Spark 3.3
>>>
>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>
>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
>>> >
>>> > Hi,
>>> >
>>> > Ryan and I drafted a design doc to support a new type of join: storage
>>> partitioned join which covers bucket join support for DataSourceV2 but is
>>> more general. The goal is to let Spark leverage distribution properties
>>> reported by data sources and eliminate shuffle whenever possible.
>>> >
>>> > Design doc:
>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>> (includes a POC link at the end)
>>> >
>>> > We'd like to start a discussion on the doc and any feedback is welcome!
>>> >
>>> > Thanks,
>>> > Chao
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>>
>>>
>>>
>>> --
>>>
>>> John Zhuge
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>

-- 
Ryan Blue

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Ryan Blue

Instead of commenting on the doc, could we keep discussion here on the dev
list please? That way more people can follow it and there is more room for
discussion. Comment threads have a very small area and easily become hard
to follow.

Ryan

On Tue, Oct 26, 2021 at 9:32 AM John Zhuge  wrote:

> +1  Nicely done!
>
> On Tue, Oct 26, 2021 at 8:08 AM Chao Sun  wrote:
>
>> Oops, sorry. I just fixed the permission setting.
>>
>> Thanks everyone for the positive support!
>>
>> On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan  wrote:
>>
>>> +1 to this SPIP and nice writeup of the design doc!
>>>
>>> Can we open comment permission in the doc so that we can discuss details
>>> there?
>>>
>>> On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> Seems making sense to me.
>>>>
>>>> Would be great to have some feedback from people such as @Wenchen Fan
>>>>  @Cheng Su  @angers zhu
>>>> .
>>>>
>>>>
>>>> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> +1 for this SPIP.
>>>>>
>>>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao 
>>>>> wrote:
>>>>>
>>>>>> +1. Thanks for lifting the current restrictions on bucket join and
>>>>>> making this more generalized.
>>>>>>
>>>>>> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:
>>>>>>
>>>>>>> +1 from me as well. Thanks Chao for doing so much to get it to this
>>>>>>> point!
>>>>>>>
>>>>>>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
>>>>>>>
>>>>>>>> +1 on this SPIP.
>>>>>>>>
>>>>>>>> This is a more generalized version of bucketed tables and bucketed
>>>>>>>> joins which can eliminate very expensive data shuffles when joins,
>>>>>>>> and
>>>>>>>> many users in the Apache Spark community have wanted this feature
>>>>>>>> for
>>>>>>>> a long time!
>>>>>>>>
>>>>>>>> Thank you, Ryan and Chao, for working on this, and I look forward to
>>>>>>>> it as a new feature in Spark 3.3
>>>>>>>>
>>>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>>>
>>>>>>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun 
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > Ryan and I drafted a design doc to support a new type of join:
>>>>>>>> storage partitioned join which covers bucket join support for 
>>>>>>>> DataSourceV2
>>>>>>>> but is more general. The goal is to let Spark leverage distribution
>>>>>>>> properties reported by data sources and eliminate shuffle whenever 
>>>>>>>> possible.
>>>>>>>> >
>>>>>>>> > Design doc:
>>>>>>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>>>>>>> (includes a POC link at the end)
>>>>>>>> >
>>>>>>>> > We'd like to start a discussion on the doc and any feedback is
>>>>>>>> welcome!
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Chao
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>
>
> --
> John Zhuge
>


-- 
Ryan Blue

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-24 Thread Ryan Blue

+1 from me as well. Thanks Chao for doing so much to get it to this point!

On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:

> +1 on this SPIP.
>
> This is a more generalized version of bucketed tables and bucketed
> joins which can eliminate very expensive data shuffles when joins, and
> many users in the Apache Spark community have wanted this feature for
> a long time!
>
> Thank you, Ryan and Chao, for working on this, and I look forward to
> it as a new feature in Spark 3.3
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
> >
> > Hi,
> >
> > Ryan and I drafted a design doc to support a new type of join: storage
> partitioned join which covers bucket join support for DataSourceV2 but is
> more general. The goal is to let Spark leverage distribution properties
> reported by data sources and eliminate shuffle whenever possible.
> >
> > Design doc:
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
> (includes a POC link at the end)
> >
> > We'd like to start a discussion on the doc and any feedback is welcome!
> >
> > Thanks,
> > Chao
>


-- 
Ryan Blue

Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-24 Thread Ryan Blue

I don't think that it makes sense to discuss a different approach in the PR
rather than in the vote. Let's discuss this now since that's the purpose of
an SPIP.

On Mon, May 24, 2021 at 11:22 AM John Zhuge  wrote:

> Hi everyone, I’d like to start a vote for the ViewCatalog design proposal
> (SPIP).
>
> The proposal is to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> The full SPIP doc is here:
> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>
> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
> update the PR for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>


-- 
Ryan Blue
Software Engineer
Netflix

[RESULT] [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue

This SPIP is adopted with the following +1 votes and no -1 or +0 votes:

Holden Karau*
John Zhuge
Chao Sun
Dongjoon Hyun*
Russell Spitzer
DB Tsai*
Wenchen Fan*
Kent Yao
Huaxin Gao
Liang-Chi Hsieh
Jungtaek Lim
Hyukjin Kwon*
Gengliang Wang
kordex
Takeshi Yamamuro
Ryan Blue

* = binding

On Mon, Mar 8, 2021 at 3:55 PM Ryan Blue  wrote:

> Hi everyone, I’d like to start a vote for the FunctionCatalog design
> proposal (SPIP).
>
> The proposal is to add a FunctionCatalog interface that can be used to
> load and list functions for Spark to call. There are interfaces for scalar
> and aggregate functions.
>
> In the discussion we’ve come to consensus and I’ve updated the design doc
> to match how functions will be called:
>
> In addition to produceResult(InternalRow), which is optional, functions
> can define produceResult methods with arguments that are Spark’s internal
> data types, like UTF8String. Spark will prefer these methods when calling
> the UDF using codgen.
>
> I’ve also updated the AggregateFunction interface and merged it with the
> partial aggregate interface because Spark doesn’t support non-partial
> aggregates.
>
> The full SPIP doc is here:
> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit#heading=h.82w8qxfl2uwl
>
> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll do
> a final update of the PR and we can merge the API.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
> --
> Ryan Blue
>


-- 
Ryan Blue

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue

And a late +1 from me.

On Fri, Mar 12, 2021 at 5:46 AM Takeshi Yamamuro 
wrote:

> +1, too.
>
> On Fri, Mar 12, 2021 at 8:51 PM kordex  wrote:
>
>> +1 (for what it's worth). It will definitely help our efforts.
>>
>> On Fri, Mar 12, 2021 at 12:14 PM Gengliang Wang  wrote:
>> >
>> > +1 (non-binding)
>> >
>> > On Fri, Mar 12, 2021 at 3:00 PM Hyukjin Kwon 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> 2021년 3월 12일 (금) 오후 2:54, Jungtaek Lim 님이
>> 작성:
>> >>>
>> >>> +1 (non-binding) Excellent description on SPIP doc! Thanks for the
>> amazing effort!
>> >>>
>> >>> On Wed, Mar 10, 2021 at 3:19 AM Liang-Chi Hsieh 
>> wrote:
>> >>>>
>> >>>>
>> >>>> +1 (non-binding).
>> >>>>
>> >>>> Thanks for the work!
>> >>>>
>> >>>>
>> >>>> Erik Krogen wrote
>> >>>> > +1 from me (non-binding)
>> >>>> >
>> >>>> > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao 
>> >>>>
>> >>>> > huaxin.gao11@
>> >>>>
>> >>>> >  wrote:
>> >>>> >
>> >>>> >> +1 (non-binding)
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Sent from:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> >>>>
>> >>>> -
>> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


-- 
Ryan Blue
Software Engineer
Netflix

[VOTE] SPIP: Add FunctionCatalog

2021-03-08 Thread Ryan Blue

Hi everyone, I’d like to start a vote for the FunctionCatalog design
proposal (SPIP).

The proposal is to add a FunctionCatalog interface that can be used to load
and list functions for Spark to call. There are interfaces for scalar and
aggregate functions.

In the discussion we’ve come to consensus and I’ve updated the design doc
to match how functions will be called:

In addition to produceResult(InternalRow), which is optional, functions can
define produceResult methods with arguments that are Spark’s internal data
types, like UTF8String. Spark will prefer these methods when calling the
UDF using codgen.

I’ve also updated the AggregateFunction interface and merged it with the
partial aggregate interface because Spark doesn’t support non-partial
aggregates.

The full SPIP doc is here:
https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit#heading=h.82w8qxfl2uwl

Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll do
a final update of the PR and we can merge the API.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …
-- 
Ryan Blue

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Ryan Blue

Okay, great. I'll update the SPIP doc and call a vote in the next day or
two.

On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen  wrote:

> +1 on Dongjoon's proposal. This is a very nice compromise between the
> reflective/magic-method approach and the InternalRow approach, providing
> a lot of flexibility for our users, and allowing for the more complicated
> reflection-based approach to evolve at its own pace, since you can always
> fall back to InternalRow for situations which aren't yet supported by
> reflection.
>
> We can even consider having Spark code detect that you haven't overridden
> the default produceResult (IIRC this is discoverable via reflection), and
> raise an error at query analysis time instead of at runtime when it can't
> find a reflective method or an overridden produceResult.
>
> I'm very pleased we have found a compromise that everyone seems happy
> with! Big thanks to everyone who participated.
>
> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge  wrote:
>
>> +1 Good plan to move forward.
>>
>> Thank you all for the constructive and comprehensive discussions in this
>> thread! Decisions on this important feature will have ramifications for
>> years to come.
>>
>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan  wrote:
>>
>>> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
>>> variants and prefer the "magical methods", then we can have a single
>>> ScalarFunction interface which has the row-parameter API (with a
>>> default implementation to fail) and documents to describe the "magical
>>> methods" (which can be done later).
>>>
>>> I'll start the PR review this week to check the naming, doc, etc.
>>>
>>> Thanks all for the discussion here and let's move forward!
>>>
>>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue  wrote:
>>>
>>>> Good point, Dongjoon. I think we can probably come to some compromise
>>>> here:
>>>>
>>>>- Remove SupportsInvoke since it isn’t really needed. We should
>>>>always try to find the right method to invoke in the codegen path.
>>>>- Add a default implementation of produceResult so that
>>>>implementations don’t have to use it. If they don’t implement it and a
>>>>magic function can’t be found, then it will throw
>>>>UnsupportedOperationException
>>>>
>>>> This is assuming that we can agree not to introduce all of the
>>>> ScalarFunction interface variations, which would have limited utility
>>>> because of type erasure.
>>>>
>>>> Does that sound like a good plan to everyone? If so, I’ll update the
>>>> SPIP doc so we can move forward.
>>>>
>>>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> We shared many opinions in different perspectives.
>>>>> However, we didn't reach a consensus even on a partial merge by
>>>>> excluding something
>>>>> (on the PR by me, on this mailing thread by Wenchen).
>>>>>
>>>>> For the following claims, we have another alternative to mitigate it.
>>>>>
>>>>> > I don't like it because it promotes the row-parameter API and
>>>>> forces users to implement it, even if the users want to only use the
>>>>> individual-parameters API.
>>>>>
>>>>> Why don't we merge the AS-IS PR by adding something instead of
>>>>> excluding something?
>>>>>
>>>>> - R produceResult(InternalRow input);
>>>>> + default R produceResult(InternalRow input) throws Exception {
>>>>> +   throw new UnsupportedOperationException();
>>>>> + }
>>>>>
>>>>> By providing the default implementation, it will not *forcing users to
>>>>> implement it* technically.
>>>>> And, we can provide a document about our expected usage properly.
>>>>> What do you think?
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>>>>>
>>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the
>>>>>> cost of using Object[], and primitive types need to do boxing
>>>>>>
>>>>>> The question is not whether to use the magi

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue

Good point, Dongjoon. I think we can probably come to some compromise here:

   - Remove SupportsInvoke since it isn’t really needed. We should always
   try to find the right method to invoke in the codegen path.
   - Add a default implementation of produceResult so that implementations
   don’t have to use it. If they don’t implement it and a magic function can’t
   be found, then it will throw UnsupportedOperationException

This is assuming that we can agree not to introduce all of the
ScalarFunction interface variations, which would have limited utility
because of type erasure.

Does that sound like a good plan to everyone? If so, I’ll update the SPIP
doc so we can move forward.

On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> We shared many opinions in different perspectives.
> However, we didn't reach a consensus even on a partial merge by excluding
> something
> (on the PR by me, on this mailing thread by Wenchen).
>
> For the following claims, we have another alternative to mitigate it.
>
> > I don't like it because it promotes the row-parameter API and forces
> users to implement it, even if the users want to only use the
> individual-parameters API.
>
> Why don't we merge the AS-IS PR by adding something instead of excluding
> something?
>
> - R produceResult(InternalRow input);
> + default R produceResult(InternalRow input) throws Exception {
> +   throw new UnsupportedOperationException();
> + }
>
> By providing the default implementation, it will not *forcing users to
> implement it* technically.
> And, we can provide a document about our expected usage properly.
> What do you think?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue  wrote:
>
>> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
>> using Object[], and primitive types need to do boxing
>>
>> The question is not whether to use the magic functions, which would not
>> need boxing. The question here is whether to use multiple ScalarFunction
>> interfaces. Those interfaces would require boxing or using Object[] so
>> there isn’t a benefit.
>>
>> If we do want to reuse one UDF for different types, using “magical
>> methods” solves the problem
>>
>> Yes, that’s correct. We agree that magic methods are a good option for
>> this.
>>
>> Again, the question we need to decide is whether to use InternalRow or
>> interfaces like ScalarFunction2 for non-codegen. The option to use
>> multiple interfaces is limited by type erasure because you can only have
>> one set of type parameters. If you wanted to support both 
>> ScalarFunction2> Integer> and ScalarFunction2 you’d have to fall back to 
>> ScalarFunction2> Object> and cast.
>>
>> The point is that type erasure will commonly lead either to many
>> different implementation classes (one for each type combination) or will
>> lead to parameterizing by Object, which defeats the purpose.
>>
>> The alternative adds safety because correct types are produced by calls
>> like getLong(0). Yes, this depends on the implementation making the
>> correct calls, but it is better than using Object and casting.
>>
>> I can’t think of real use cases that will force the individual-parameters
>> approach to use Object instead of concrete types.
>>
>> I think this is addressed by the type erasure discussion above. A simple
>> Plus method would require Object or 12 implementations for 2 arguments
>> and 4 numeric types.
>>
>> And basically all varargs cases would need to use Object[]. Consider a
>> UDF to create a map that requires string keys and some consistent type for
>> values. This would be easy with the InternalRow API because you can use
>> getString(pos) and get(pos + 1, valueType) to get the key/value pairs.
>> Use of UTF8String vs String will be checked at compile time.
>>
>> I agree that Object[] is worse than InternalRow
>>
>> Yes, and if we are always using Object because of type erasure or using
>> magic methods to get better performance, the utility of the parameterized
>> interfaces is very limited.
>>
>> Because we want to expose the magic functions, the use of ScalarFunction2
>> and similar is extremely limited because it is only for non-codegen.
>> Varargs is by far the more common case. The InternalRow interface is
>> also a very simple way to get started and ensures that Spark can always
>> find the right method after the function is bound to input types.
>>
>> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan  wrote:
>>
>>> Yes, GenericInternalRow is safe if w

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue

Yes, GenericInternalRow is safe if when type mismatches, with the cost of
using Object[], and primitive types need to do boxing

The question is not whether to use the magic functions, which would not
need boxing. The question here is whether to use multiple ScalarFunction
interfaces. Those interfaces would require boxing or using Object[] so
there isn’t a benefit.

If we do want to reuse one UDF for different types, using “magical methods”
solves the problem

Yes, that’s correct. We agree that magic methods are a good option for this.

Again, the question we need to decide is whether to use InternalRow or
interfaces like ScalarFunction2 for non-codegen. The option to use multiple
interfaces is limited by type erasure because you can only have one set of
type parameters. If you wanted to support both ScalarFunction2 and ScalarFunction2 you’d have to fall back to
ScalarFunction2 and cast.

The point is that type erasure will commonly lead either to many different
implementation classes (one for each type combination) or will lead to
parameterizing by Object, which defeats the purpose.

The alternative adds safety because correct types are produced by calls
like getLong(0). Yes, this depends on the implementation making the correct
calls, but it is better than using Object and casting.

I can’t think of real use cases that will force the individual-parameters
approach to use Object instead of concrete types.

I think this is addressed by the type erasure discussion above. A simple
Plus method would require Object or 12 implementations for 2 arguments and
4 numeric types.

And basically all varargs cases would need to use Object[]. Consider a UDF
to create a map that requires string keys and some consistent type for
values. This would be easy with the InternalRow API because you can use
getString(pos) and get(pos + 1, valueType) to get the key/value pairs. Use
of UTF8String vs String will be checked at compile time.

I agree that Object[] is worse than InternalRow

Yes, and if we are always using Object because of type erasure or using
magic methods to get better performance, the utility of the parameterized
interfaces is very limited.

Because we want to expose the magic functions, the use of ScalarFunction2
and similar is extremely limited because it is only for non-codegen.
Varargs is by far the more common case. The InternalRow interface is also a
very simple way to get started and ensures that Spark can always find the
right method after the function is bound to input types.

On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan  wrote:

> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
> using Object[], and primitive types need to do boxing. And this is a
> runtime failure, which is absolutely worse than query-compile-time checks.
> Also, don't forget my previous point: users need to specify the type and
> index such as row.getLong(0), which is error-prone.
>
> > But we don’t do that for any of the similar UDFs today so I’m skeptical
> that this would actually be a high enough priority to implement.
>
> I'd say this is a must-have if we go with the individual-parameters
> approach. The Scala UDF today checks the method signature at compile-time,
> thanks to the Scala type tag. The Java UDF today doesn't check and is hard
> to use.
>
> > You can’t implement ScalarFunction2 and
> ScalarFunction2.
>
> Can you elaborate? We have function binding and we can use different UDFs
> for different input types. If we do want to reuse one UDF
> for different types, using "magical methods" solves the problem:
> class MyUDF {
>   def call(i: Int): Int = ...
>   def call(l: Long): Long = ...
> }
>
> On the other side, I don't think the row-parameter approach can solve this
> problem. The best I can think of is:
> class MyUDF implement ScalaFunction[Object] {
>   def call(row: InternalRow): Object = {
> if (int input) row.getInt(0) ... else row.getLong(0) ...
>   }
> }
>
> This is worse because: 1) it needs to do if-else to check different input
> types. 2) the return type can only be Object and cause boxing issues.
>
> I agree that Object[] is worse than InternalRow. But I can't think of
> real use cases that will force the individual-parameters approach to use
> Object instead of concrete types.
>
>
> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue  wrote:
>
>> Thanks for adding your perspective, Erik!
>>
>> If the input is string type but the UDF implementation calls
>> row.getLong(0), it returns wrong data
>>
>> I think this is misleading. It is true for UnsafeRow, but there is no
>> reason why InternalRow should return incorrect values.
>>
>> The implementation in GenericInternalRow
>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressi

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-01 Thread Ryan Blue

Thanks for adding your perspective, Erik!

If the input is string type but the UDF implementation calls row.getLong(0),
it returns wrong data

I think this is misleading. It is true for UnsafeRow, but there is no
reason why InternalRow should return incorrect values.

The implementation in GenericInternalRow
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35>
would throw a ClassCastException. I don’t think that using a row is a bad
option simply because UnsafeRow is unsafe.

It’s unlikely that UnsafeRow would be used to pass the data. The
implementation would evaluate each argument expression and set the result
in a generic row, then pass that row to the UDF. We can use whatever
implementation we choose to provide better guarantees than unsafe.

I think we should consider query-compile-time checks as nearly-as-good as
Java-compile-time checks for the purposes of safety.

I don’t think I agree with this. A failure at query analysis time vs
runtime still requires going back to a separate project, fixing something,
and rebuilding. The time needed to fix a problem goes up significantly vs.
compile-time checks. And that is even worse if the UDF is maintained by
someone else.

I think we also need to consider how common it would be that a use case can
have the query-compile-time checks. Going through this in more detail below
makes me think that it is unlikely that these checks would be used often
because of the limitations of using an interface with type erasure.

I believe that Wenchen’s proposal will provide stronger query-compile-time
safety

The proposal could have better safety for each argument, assuming that we
detect failures by looking at the parameter types using reflection in the
analyzer. But we don’t do that for any of the similar UDFs today so I’m
skeptical that this would actually be a high enough priority to implement.

As Erik pointed out, type erasure also limits the effectiveness. You can’t
implement ScalarFunction2 and ScalarFunction2.
You can handle those cases using InternalRow or you can handle them using
VarargScalarFunction. That forces many use cases into varargs with
Object, where you don’t get any of the proposed analyzer benefits and lose
compile-time checks. The only time the additional checks (if implemented)
would help is when only one set of argument types is needed because
implementing ScalarFunction defeats the purpose.

It’s worth noting that safety for the magic methods would be identical
between the two options, so the trade-off to consider is for varargs and
non-codegen cases. Combining the limitations discussed, this has better
safety guarantees only if you need just one set of types for each number of
arguments and are using the non-codegen path. Since varargs is one of the
primary reasons to use this API, then I don’t think that it is a good idea
to use Object[] instead of InternalRow.
--
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-19 Thread Ryan Blue

I don’t see any benefit to more complexity with 22 additional interfaces,
instead of simply passing an InternalRow. Why not use a single interface
with InternalRow? Maybe you could share your motivation?

That would also result in strange duplication, where the ScalarFunction2
method is just the boxed version:

class DoubleAdd implements ScalarFunction2 {
  @Override
  Double produceResult(Double left, Double right) {
return left + right;
  }

  double produceResult(double left, double right) {
return left + right;
  }
}

This would work okay, but would be awkward if you wanted to use the same
implementation for any number of arguments, like a sum method that adds all
of the arguments together and returns the result. It also isn’t great for
varargs, since it is basically the same as the invoke case.

The combination of an InternalRow method and the invoke method seems to be
a good way to handle this to me. What is wrong with it? And, how would you
solve the problem when implementations define methods with the wrong types?
The InternalRow approach helps implementations catch that problem (as
demonstrated above) and also provides a fallback when there is a but
preventing the invoke optimization from working. That seems like a good
approach to me.

On Thu, Feb 18, 2021 at 11:31 PM Wenchen Fan  wrote:

> If people have such a big concern about reflection, we can follow the current
> Spark Java UDF
> <https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java>
> and Transport
> <https://github.com/linkedin/transport/tree/master/transportable-udfs-api/src/main/java/com/linkedin/transport/api/udf>,
> and create ScalarFuncion0[R], ScalarFuncion1[T1, R], etc. to avoid
> reflection. But we may need to investigate how to avoid boxing with this
> API design.
>
> To put a detailed proposal: let's have ScalarFuncion0, ScalarFuncion1,
> ..., ScalarFuncion9 and VarargsScalarFunction. At execution time, if
> Spark sees ScalarFuncion0-9, pass the input columns to the UDF directly,
> one column one parameter. So string type input is UTF8String, array type
> input is ArrayData. If Spark sees VarargsScalarFunction, wrap the input
> columns with InternalRow and pass it to the UDF.
>
> In general, if VarargsScalarFunction is implemented, the UDF should not
> implement ScalarFuncion0-9. We can also define a priority order to allow
> this. I don't have a strong preference here.
>
> What do you think?
>
> On Fri, Feb 19, 2021 at 1:24 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> I agree with Ryan on the questions around the expressivity of the Invoke
>> method. It is not clear to me how the Invoke method can be used to declare
>> UDFs with type-parameterized parameters. For example: a UDF to get the Nth
>> element of an array (regardless of the Array element type) or a UDF to
>> merge two Arrays (of generic types) to a Map.
>>
>> Also, to address Wenchen's InternalRow question, can we create a number
>> of Function classes, each corresponding to a number of input parameter
>> length (e.g., ScalarFunction1, ScalarFunction2, etc)?
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Thu, Feb 18, 2021 at 6:07 PM Ryan Blue 
>> wrote:
>>
>>> I agree with you that it is better in many cases to directly call a
>>> method. But it it not better in all cases, which is why I don’t think it is
>>> the right general-purpose choice.
>>>
>>> First, if codegen isn’t used for some reason, the reflection overhead is
>>> really significant. That gets much better when you have an interface to
>>> call. That’s one reason I’d use this pattern:
>>>
>>> class DoubleAdd implements ScalarFunction, SupportsInvoke {
>>>   Double produceResult(InternalRow row) {
>>> return produceResult(row.getDouble(0), row.getDouble(1));
>>>   }
>>>
>>>   double produceResult(double left, double right) {
>>> return left + right;
>>>   }
>>> }
>>>
>>> There’s little overhead to adding the InternalRow variation, but we
>>> could call it in eval to avoid the reflect overhead. To the point about
>>> UDF developers, I think this is a reasonable cost.
>>>
>>> Second, I think usability is better and helps avoid runtime issues.
>>> Here’s an example:
>>>
>>> class StrLen implements ScalarFunction, SupportsInvoke {
>>>   Integer produceResult(InternalRow row) {
>>> return produceResult(row.getString(0));
>>>   }
>>>
>>>   Integer produceResult(String str) {
>>> return str.length();
>>>   }
>>> }
>>>
>>> See t

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Ryan Blue

I agree with you that it is better in many cases to directly call a method.
But it it not better in all cases, which is why I don’t think it is the
right general-purpose choice.

First, if codegen isn’t used for some reason, the reflection overhead is
really significant. That gets much better when you have an interface to
call. That’s one reason I’d use this pattern:

class DoubleAdd implements ScalarFunction, SupportsInvoke {
  Double produceResult(InternalRow row) {
return produceResult(row.getDouble(0), row.getDouble(1));
  }

  double produceResult(double left, double right) {
return left + right;
  }
}

There’s little overhead to adding the InternalRow variation, but we could
call it in eval to avoid the reflect overhead. To the point about UDF
developers, I think this is a reasonable cost.

Second, I think usability is better and helps avoid runtime issues. Here’s
an example:

class StrLen implements ScalarFunction, SupportsInvoke {
  Integer produceResult(InternalRow row) {
return produceResult(row.getString(0));
  }

  Integer produceResult(String str) {
return str.length();
  }
}

See the bug? I forgot to use UTF8String instead of String. With the
InternalRow method, I get a compiler warning because getString produces
UTF8String that can’t be passed to produceResult(String). If I decided to
implement length separately, then we could still run the InternalRow
version and log a warning. The code would be slightly slower, but wouldn’t
fail.

There are similar situations with varargs where it’s better to call methods
that produce concrete types than to cast from Object to some expected type.

I think that using invoke is a great extension to the proposal, but I don’t
think that it should be the only way to call functions. By all means, let’s
work on it in parallel and use it where possible. But I think we do need
the fallback of using InternalRow and that it isn’t a usability problem to
include it.

Oh, and one last thought is that we already have users that call
Dataset.map and use InternalRow. This would allow converting that code
directly to a UDF.

I think we’re closer to agreeing here than it actually looks. Hopefully
you’ll agree that having the InternalRow method isn’t a big usability
problem.

On Wed, Feb 17, 2021 at 11:51 PM Wenchen Fan  wrote:

> I don't see any objections to the rest of the proposal (loading functions
> from the catalog, function binding stuff, etc.) and I assume everyone is OK
> with it. We can commit that part first.
>
> Currently, the discussion focuses on the `ScalarFunction` API, where I
> think it's better to directly take the input columns as the UDF parameter,
> instead of wrapping the input columns with InternalRow and taking the
> InternalRow as the UDF parameter. It's not only for better performance,
> but also for ease of use. For example, it's easier for the UDF developer to
> write `input1 + input2` than `inputRow.getLong(0) + inputRow.getLong(1)`,
> as they don't need to specify the type and index by themselves (getLong(0))
> which is error-prone.
>
> It does push more work to the Spark side, but I think it's worth it if
> implementing UDF gets easier. I don't think the work is very challenging,
> as we can leverage the infra we built for the expression encoder.
>
> I think it's also important to look at the UDF API from the user's
> perspective (UDF developers). How do you like the UDF API without
> considering how Spark can support it? Do you prefer the
> individual-parameters version or the row-parameter version?
>
> To move forward, how about we implement the function loading and binding
> first? Then we can have PRs for both the individual-parameters (I can take
> it) and row-parameter approaches, if we still can't reach a consensus at
> that time and need to see all the details.
>
> On Thu, Feb 18, 2021 at 4:48 AM Ryan Blue 
> wrote:
>
>> Thanks, Hyukjin. I think that's a fair summary. And I agree with the idea
>> that we should focus on what Spark will do by default.
>>
>> I think we should focus on the proposal, for two reasons: first, there is
>> a straightforward path to incorporate Wenchen's suggestion via
>> `SupportsInvoke`, and second, the proposal is more complete: it defines a
>> solution for many concerns like loading a function and finding out what
>> types to use -- not just how to call code -- and supports more use cases
>> like varargs functions. I think we can continue to discuss the rest of the
>> proposal and be confident that we can support an invoke code path where it
>> makes sense.
>>
>> Does everyone agree? If not, I think we would need to solve a lot of the
>> challenges that I initially brought up with the invoke idea. It seems like
>> a good way to call a function, but needs a real proposal behind it if we
>> don't use i

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Ryan Blue

Thanks, Hyukjin. I think that's a fair summary. And I agree with the idea
that we should focus on what Spark will do by default.

I think we should focus on the proposal, for two reasons: first, there is a
straightforward path to incorporate Wenchen's suggestion via
`SupportsInvoke`, and second, the proposal is more complete: it defines a
solution for many concerns like loading a function and finding out what
types to use -- not just how to call code -- and supports more use cases
like varargs functions. I think we can continue to discuss the rest of the
proposal and be confident that we can support an invoke code path where it
makes sense.

Does everyone agree? If not, I think we would need to solve a lot of the
challenges that I initially brought up with the invoke idea. It seems like
a good way to call a function, but needs a real proposal behind it if we
don't use it via `SupportsInvoke` in the current proposal.

On Tue, Feb 16, 2021 at 11:07 PM Hyukjin Kwon  wrote:

> Just to make sure we don’t move past, I think we haven’t decided yet:
>
>- if we’ll replace the current proposal to Wenchen’s approach as the
>default
>- if we want to have Wenchen’s approach as an optional mix-in on the
>top of Ryan’s proposal (SupportsInvoke)
>
> From what I read, some people pointed out it as a replacement. Please
> correct me if I misread this discussion thread.
> As Dongjoon pointed out, it would be good to know rough ETA to make sure
> making progress in this, and people can compare more easily.
>
>
> FWIW, there’s the saying I like in the zen of Python
> <https://www.python.org/dev/peps/pep-0020/>:
>
> There should be one— and preferably only one —obvious way to do it.
>
> If multiple approaches have the way for developers to do the (almost) same
> thing, I would prefer to avoid it.
>
> In addition, I would prefer to focus on what Spark does by default first.
>
>
> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun 님이 작성:
>
>> Hi, Wenchen.
>>
>> This thread seems to get enough attention. Also, I'm expecting more and
>> more if we have this on the `master` branch because we are developing
>> together.
>>
>> > Spark SQL has many active contributors/committers and this thread
>> doesn't get much attention yet.
>>
>> So, what's your ETA from now?
>>
>> > I think the problem here is we were discussing some very detailed
>> things without actual code.
>> > I'll implement my idea after the holiday and then we can have more
>> effective discussions.
>> > We can also do benchmarks and get some real numbers.
>> > In the meantime, we can continue to discuss other parts of this
>> proposal, and make a prototype if possible.
>>
>> I'm looking forward to seeing your PR. I hope we can conclude this thread
>> and have at least one implementation in the `master` branch this month
>> (February).
>> If you need more time (one month or longer), why don't we have Ryan's
>> suggestion in the `master` branch first and benchmark with your PR later
>> during Apache Spark 3.2 timeframe.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue 
>> wrote:
>>
>>> Andrew,
>>>
>>> The proposal already includes an API for aggregate functions and I think
>>> we would want to implement those right away.
>>>
>>> Processing ColumnBatch is something we can easily extend the interfaces
>>> to support, similar to Wenchen's suggestion. The important thing right now
>>> is to agree on some basic functionality: how to look up functions and what
>>> the simple API should be. Like the TableCatalog interfaces, we will layer
>>> on more support through optional interfaces like `SupportsInvoke` or
>>> `SupportsColumnBatch`.
>>>
>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo 
>>> wrote:
>>>
>>>> Hello Ryan,
>>>>
>>>> This proposal looks very interesting. Would future goals for this
>>>> functionality include both support for aggregation functions, as well
>>>> as support for processing ColumnBatch-es (instead of Row/InternalRow)?
>>>>
>>>> Thanks
>>>> Andrew
>>>>
>>>> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue 
>>>> wrote:
>>>> >
>>>> > Thanks for the positive feedback, everyone. It sounds like there is a
>>>> clear path forward for calling functions. Even without a prototype, the
>>>> `invoke` plans show that Wenchen's suggested optimization can be done, and
>>>> incorporating it as an optional

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Ryan Blue

Andrew,

The proposal already includes an API for aggregate functions and I think we
would want to implement those right away.

Processing ColumnBatch is something we can easily extend the interfaces to
support, similar to Wenchen's suggestion. The important thing right now is
to agree on some basic functionality: how to look up functions and what the
simple API should be. Like the TableCatalog interfaces, we will layer on
more support through optional interfaces like `SupportsInvoke` or
`SupportsColumnBatch`.

On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo  wrote:

> Hello Ryan,
>
> This proposal looks very interesting. Would future goals for this
> functionality include both support for aggregation functions, as well
> as support for processing ColumnBatch-es (instead of Row/InternalRow)?
>
> Thanks
> Andrew
>
> On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue 
> wrote:
> >
> > Thanks for the positive feedback, everyone. It sounds like there is a
> clear path forward for calling functions. Even without a prototype, the
> `invoke` plans show that Wenchen's suggested optimization can be done, and
> incorporating it as an optional extension to this proposal solves many of
> the unknowns.
> >
> > With that area now understood, is there any discussion about other parts
> of the proposal, besides the function call interface?
> >
> > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:
> >>
> >> This is an important feature which can unblock several other projects
> including bucket join support for DataSource v2, complete support for
> enforcing DataSource v2 distribution requirements on the write path, etc. I
> like Ryan's proposals which look simple and elegant, with nice support on
> function overloading and variadic arguments. On the other hand, I think
> Wenchen made a very good point about performance. Overall, I'm excited to
> see active discussions on this topic and believe the community will come to
> a proposal with the best of both sides.
> >>
> >> Chao
> >>
> >> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon 
> wrote:
> >>>
> >>> +1 for Liang-chi's.
> >>>
> >>> Thanks Ryan and Wenchen for leading this.
> >>>
> >>>
> >>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
> >>>>
> >>>> Basically I think the proposal makes sense to me and I'd like to
> support the
> >>>> SPIP as it looks like we have strong need for the important feature.
> >>>>
> >>>> Thanks Ryan for working on this and I do also look forward to
> Wenchen's
> >>>> implementation. Thanks for the discussion too.
> >>>>
> >>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
> >>>> alternative to me. Besides Wenchen's alternative implementation, is
> there a
> >>>> chance we also have the SupportsInvoke for comparison?
> >>>>
> >>>>
> >>>> John Zhuge wrote
> >>>> > Excited to see our Spark community rallying behind this important
> feature!
> >>>> >
> >>>> > The proposal lays a solid foundation of minimal feature set with
> careful
> >>>> > considerations for future optimizations and extensions. Can't wait
> to see
> >>>> > it leading to more advanced functionalities like views with shared
> custom
> >>>> > functions, function pushdown, lambda, etc. It has already borne
> fruit from
> >>>> > the constructive collaborations in this thread. Looking forward to
> >>>> > Wenchen's prototype and further discussions including the
> SupportsInvoke
> >>>> > extension proposed by Ryan.
> >>>> >
> >>>> >
> >>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
> >>>>
> >>>> > owen.omalley@
> >>>>
> >>>> > 
> >>>> > wrote:
> >>>> >
> >>>> >> I think this proposal is a very good thing giving Spark a standard
> way of
> >>>> >> getting to and calling UDFs.
> >>>> >>
> >>>> >> I like having the ScalarFunction as the API to call the UDFs. It is
> >>>> >> simple, yet covers all of the polymorphic type cases well. I think
> it
> >>>> >> would
> >>>> >> also simplify using the functions in other contexts like pushing
> down
> >>>> >> filters into the ORC & Parquet readers although there are a

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ryan Blue

Thanks for the positive feedback, everyone. It sounds like there is a clear
path forward for calling functions. Even without a prototype, the `invoke`
plans show that Wenchen's suggested optimization can be done, and
incorporating it as an optional extension to this proposal solves many of
the unknowns.

With that area now understood, is there any discussion about other parts of
the proposal, besides the function call interface?

On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:

> This is an important feature which can unblock several other projects
> including bucket join support for DataSource v2, complete support for
> enforcing DataSource v2 distribution requirements on the write path, etc. I
> like Ryan's proposals which look simple and elegant, with nice support on
> function overloading and variadic arguments. On the other hand, I think
> Wenchen made a very good point about performance. Overall, I'm excited to
> see active discussions on this topic and believe the community will come to
> a proposal with the best of both sides.
>
> Chao
>
> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:
>
>> +1 for Liang-chi's.
>>
>> Thanks Ryan and Wenchen for leading this.
>>
>>
>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
>>
>>> Basically I think the proposal makes sense to me and I'd like to support
>>> the
>>> SPIP as it looks like we have strong need for the important feature.
>>>
>>> Thanks Ryan for working on this and I do also look forward to Wenchen's
>>> implementation. Thanks for the discussion too.
>>>
>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>>> alternative to me. Besides Wenchen's alternative implementation, is
>>> there a
>>> chance we also have the SupportsInvoke for comparison?
>>>
>>>
>>> John Zhuge wrote
>>> > Excited to see our Spark community rallying behind this important
>>> feature!
>>> >
>>> > The proposal lays a solid foundation of minimal feature set with
>>> careful
>>> > considerations for future optimizations and extensions. Can't wait to
>>> see
>>> > it leading to more advanced functionalities like views with shared
>>> custom
>>> > functions, function pushdown, lambda, etc. It has already borne fruit
>>> from
>>> > the constructive collaborations in this thread. Looking forward to
>>> > Wenchen's prototype and further discussions including the
>>> SupportsInvoke
>>> > extension proposed by Ryan.
>>> >
>>> >
>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley 
>>>
>>> > owen.omalley@
>>>
>>> > 
>>> > wrote:
>>> >
>>> >> I think this proposal is a very good thing giving Spark a standard
>>> way of
>>> >> getting to and calling UDFs.
>>> >>
>>> >> I like having the ScalarFunction as the API to call the UDFs. It is
>>> >> simple, yet covers all of the polymorphic type cases well. I think it
>>> >> would
>>> >> also simplify using the functions in other contexts like pushing down
>>> >> filters into the ORC & Parquet readers although there are a lot of
>>> >> details
>>> >> that would need to be considered there.
>>> >>
>>> >> .. Owen
>>> >>
>>> >>
>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen 
>>>
>>> > ekrogen@.com
>>>
>>> > 
>>> >> wrote:
>>> >>
>>> >>> I agree that there is a strong need for a FunctionCatalog within
>>> Spark
>>> >>> to
>>> >>> provide support for shareable UDFs, as well as make movement towards
>>> >>> more
>>> >>> advanced functionality like views which themselves depend on UDFs,
>>> so I
>>> >>> support this SPIP wholeheartedly.
>>> >>>
>>> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>> >>> and
>>> >>> extensible. I generally think Wenchen's proposal is easier for a
>>> user to
>>> >>> work with in the common case, but has greater potential for confusing
>>> >>> and
>>> >>> hard-to-debug behavior due to use of reflective method signature
>>> >>> searches.
>>> >>> The merits on both sides can hopefully be more properly exa

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-10 Thread Ryan Blue

due to trying to come up with a
>>>> better design. We already have an ugly picture of the current Spark APIs to
>>>> draw a better bigger picture.
>>>>
>>>>
>>>> 2021년 2월 10일 (수) 오전 3:28, Holden Karau 님이 작성:
>>>>
>>>>> I think this proposal is a good set of trade-offs and has existed in
>>>>> the community for a long period of time. I especially appreciate how the
>>>>> design is focused on a minimal useful component, with future optimizations
>>>>> considered from a point of view of making sure it's flexible, but actual
>>>>> concrete decisions left for the future once we see how this API is used. I
>>>>> think if we try and optimize everything right out of the gate, we'll
>>>>> quickly get stuck (again) and not make any progress.
>>>>>
>>>>> On Mon, Feb 8, 2021 at 10:46 AM Ryan Blue  wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I'd like to start a discussion for adding a FunctionCatalog interface
>>>>>> to catalog plugins. This will allow catalogs to expose functions to 
>>>>>> Spark,
>>>>>> similar to how the TableCatalog interface allows a catalog to expose
>>>>>> tables. The proposal doc is available here:
>>>>>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>>>>>
>>>>>> Here's a high-level summary of some of the main design choices:
>>>>>> * Adds the ability to list and load functions, not to create or
>>>>>> modify them in an external catalog
>>>>>> * Supports scalar, aggregate, and partial aggregate functions
>>>>>> * Uses load and bind steps for better error messages and simpler
>>>>>> implementations
>>>>>> * Like the DSv2 table read and write APIs, it uses InternalRow to
>>>>>> pass data
>>>>>> * Can be extended using mix-in interfaces to add vectorization,
>>>>>> codegen, and other future features
>>>>>>
>>>>>> There is also a PR with the proposed API:
>>>>>> https://github.com/apache/spark/pull/24559/files
>>>>>>
>>>>>> Let's discuss the proposal here rather than on that PR, to get better
>>>>>> visibility. Also, please take the time to read the proposal first. That
>>>>>> really helps clear up misconceptions.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>> --
Ryan Blue

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Ryan Blue

I don’t think that using Invoke really works. The usability is poor if
something goes wrong and it can’t handle varargs or parameterized use cases
well. There isn’t a significant enough performance penalty for passing data
as a row to justify making the API much more difficult and less expressive.
I don’t think that it makes much sense to move forward with the idea.

More replies inline.

On Tue, Feb 9, 2021 at 2:37 AM Wenchen Fan  wrote:

> There’s also a massive performance penalty for the Invoke approach when
> falling back to non-codegen because the function is loaded and invoked each
> time eval is called. It is much cheaper to use a method in an interface.
>
> Can you elaborate? Using the row parameter or individual parameters
> shouldn't change the life cycle of the UDF instance.
>
The eval method looks up the method and invokes it every time using
reflection. That’s quite a bit slower than calling an interface method on
an UDF instance.

> Should they use String or UTF8String? What representations are supported
> and how will Spark detect and produce those representations?
>
> It's the same as InternalRow. We can just copy-paste the document of
> InternalRow to explain the corresponding java type for each data type.
>
My point is that having a single method signature that uses InternalRow and
is inherited from an interface is much easier for users and Spark. If a
user forgets to use UTF8String and writes a method with String instead,
then the UDF wouldn’t work. What then? Does Spark detect that the wrong
type was used? It would need to or else it would be difficult for a UDF
developer to tell what is wrong. And this is a runtime issue so it is
caught late.
-- 
Ryan Blue

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-08 Thread Ryan Blue

Wenchen,

There are a few issues with the Invoke approach, and I don’t think that it
is really much better for the additional complexity of the API.

First I think that you’re confusing codegen to call a function with codegen
to implement a function. The non-goal refers to supporting codegen to
*implement* a UDF. That’s what could have differences between the called
version and generated version. But the Invoke option isn’t any different in
that case because Invoke codegen is only used to call a method, and we can
codegen int result = udfName(x, y) just like we can codegen int result =
udfName(row).

The Invoke approach also has a problem with expressiveness. Consider a map
function that builds a map from its inputs as key/value pairs: map("x", r *
cos(theta), "y", r * sin(theta)). If this requires a defined Java function,
then there must be lots of implementations for different numbers of pairs,
for different types, etc:

public MapData buildMap(String k1, int v1);
...
public MapData buildMap(String k1, long v1);
...
public MapData buildMap(String k1, float v1);
...
public MapData buildMap(String k1, double v1);
public MapData buildMap(String k1, double v1, String k2, double v2);
public MapData buildMap(String k1, double v1, String k2, double v2,
String k3, double v3);
...

Clearly, this and many other use cases would fall back to varargs instead.
In that case, there is little benefit to using invoke because all of the
arguments will get collected into an Object[] anyway. That’s basically the
same thing as using a row object, just without convenience functions that
return specific types like getString, forcing implementations to cast
instead. And, the Invoke approach has a performance *penalty* when existing
rows could be simply projected using a wrapper.

There’s also a massive performance penalty for the Invoke approach when
falling back to non-codegen because the function is loaded and invoked each
time eval is called. It is much cheaper to use a method in an interface.

Next, the Invoke approach is much more complicated for implementers to use.
Should they use String or UTF8String? What representations are supported
and how will Spark detect and produce those representations? What if a
function uses both String *and* UTF8String? Will Spark detect this for each
parameter? Having one or two functions called by Spark is much easier to
maintain in Spark and avoid a lot of debugging headaches when something
goes wrong.

On Mon, Feb 8, 2021 at 12:00 PM Wenchen Fan  wrote:

This is a very important feature, thanks for working on it!
>
> Spark uses codegen by default, and it's a bit unfortunate to see that
> codegen support is treated as a non-goal. I think it's better to not ask
> the UDF implementations to provide two different code paths for interpreted
> evaluation and codegen evaluation. The Expression API does so and it's very
> painful. Many bugs were found due to inconsistencies between
> the interpreted and codegen code paths.
>
> Now, Spark has the infra to call arbitrary Java functions in
> both interpreted and codegen code paths, see StaticInvoke and Invoke. I
> think we are able to define the UDF API in the most efficient way.
> For example, a UDF that takes long and string, and returns int:
>
> class MyUDF implements ... {
>   int call(long arg1, UTF8String arg2) { ... }
> }
>
> There is no compile-time type-safety. But there is also no boxing, no
> extra InternalRow building, no separated interpreted and codegen code
> paths. The UDF will report input data types and result data type, so the
> analyzer can check if the call method is valid via reflection, and we
> still have query-compile-time type safety. It also simplifies development
> as we can just use the Invoke expression to invoke UDFs.
>
> On Tue, Feb 9, 2021 at 2:52 AM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> I'd like to start a discussion for adding a FunctionCatalog interface to
>> catalog plugins. This will allow catalogs to expose functions to Spark,
>> similar to how the TableCatalog interface allows a catalog to expose
>> tables. The proposal doc is available here:
>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>
>> Here's a high-level summary of some of the main design choices:
>> * Adds the ability to list and load functions, not to create or modify
>> them in an external catalog
>> * Supports scalar, aggregate, and partial aggregate functions
>> * Uses load and bind steps for better error messages and simpler
>> implementations
>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>> data
>> * Can be extended using mix-in interfaces to add vectorization, codegen,
>> and other future features
>>
>> There is also a PR with the proposed API:
>>

[DISCUSS] SPIP: FunctionCatalog

2021-02-08 Thread Ryan Blue

Hi everyone,

I'd like to start a discussion for adding a FunctionCatalog interface to
catalog plugins. This will allow catalogs to expose functions to Spark,
similar to how the TableCatalog interface allows a catalog to expose
tables. The proposal doc is available here:
https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit

Here's a high-level summary of some of the main design choices:
* Adds the ability to list and load functions, not to create or modify them
in an external catalog
* Supports scalar, aggregate, and partial aggregate functions
* Uses load and bind steps for better error messages and simpler
implementations
* Like the DSv2 table read and write APIs, it uses InternalRow to pass data
* Can be extended using mix-in interfaces to add vectorization, codegen,
and other future features

There is also a PR with the proposed API:
https://github.com/apache/spark/pull/24559/files

Let's discuss the proposal here rather than on that PR, to get better
visibility. Also, please take the time to read the proposal first. That
really helps clear up misconceptions.



-- 
Ryan Blue

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-12-01 Thread Ryan Blue

Wenchen, could you start a new thread? Many people have probably already
muted this one, and it isn't really on topic.

The question that needs to be discussed is whether this is a safe change
for the 3.1 release, and reusing an old thread is not a great way to get
people's attention about something potentially harmful like that.

On Tue, Dec 1, 2020 at 10:46 AM Wenchen Fan  wrote:

> I'm reviving this thread because this feature was reverted before the 3.0
> release, and now we are trying to add it back since the CREATE TABLE syntax
> is unified.
>
> The benefits are pretty clear: CREATE TABLE by default (without USING or
> STORED AS) should create native tables that work best with Spark. You can
> see all the benefits listed in https://github.com/apache/spark/pull/30554.
>
> I'm sending this email to collect feedback about the risks. AFAIK
> the broken use cases are:
> 1. A user issues `CREATE TABLE ... LOCATION ...`. After some table
> insertions he want to read the data files directly from the table location.
> Because the file format is changed from Hive text to Parquet, this use case
> may be broken.
> 2. A user issues `CREATE TABLE ...` and then runs `ALTER TABLE SET SERDE`
> or `LOAD DATA`. These two are Hive specific commands and doesn't work with
> Spark native tables.
> 3. A user issues `CREATE TABLE ...` and then uses Hive to add partitions
> with different serdes to this table. Spark doesn't allow a native
> partitioned table to have partitions with different formats.
>
> From my personal experience, the Hive text tables are usually used to
> import CSV-like data. It's very likely that people will create Hive text
> table explicitly as they need the Hive syntax to specify options like
> delimiter. Besides, I'm not sure how many Spark users are using this
> feature, as the native CSV data source can do the same job.
>
> I'd consider it a bad user experience if a simple `CREATE TABLE` gives
> users a very slow table. Changing it to return native Parquet table doesn't
> seems to break many people, but I can be wrong.
>
> Please reply to this thread if you know more use cases that may be
> affected by this change, and share your thoughts.
>
> Thanks,
> Wenchen
>
> On Sat, Dec 7, 2019 at 1:58 PM Takeshi Yamamuro 
> wrote:
>
>> Oh, looks nice. Thanks for the sharing, Dongjoon
>>
>> Bests,
>> Takeshi
>>
>> On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> I want to share the following change to the community.
>>>
>>> SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax
>>>
>>> This is merged today and now Spark's `CREATE TABLE` is using Spark's
>>> default data sources instead of `hive` provider. This is a good and big
>>> improvement for Apache Spark 3.0, but this might surprise someone. (Please
>>> note that there is a fallback option for them.)
>>>
>>> Thank you, Yi, Wenchen, Xiao.
>>>
>>> Cheers,
>>> Dongjoon.
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Seeking committers' help to review on SS PR

2020-11-30 Thread Ryan Blue

Jungtaek,

If there are contributors that you trust for reviews, then please let PMC
members know so they can be considered. I agree that is the best solution.

If there aren't contributors that the PMC wants to add as committers, then
I suggest agreeing on a temporary exception to help make progress in this
area and give contributors more opportunities to develop. Something like
this: for the next 6 months, contributions from committers to SS can be
committed without a committer +1 if they are reviewed by at least one
contributor (and have no dissent from committers, of course). Then after
the period expires, we would ideally have new people ready to be added as
committers.

That would need to be voted on, but I think it is a reasonable step to help
resuscitate Spark streaming.

On Fri, Nov 27, 2020 at 7:15 PM Sean Owen  wrote:

> I don't know the code well, but those look minor and straightforward. They
> have reviews from the two most knowledgeable people in this area. I don't
> think you need to block for 6 months after proactively seeking all likely
> reviewers - I'm saying that's the resolution to this type of situation
> (too).
>
> On Fri, Nov 27, 2020 at 8:55 PM Jungtaek Lim 
> wrote:
>
>> Btw, there are two more PRs which got LGTM by a SS contributor but fail
>> to get attention from committers. They're 6+ months old. Could you help
>> reviewing this as well, or do you all think 6 months of time range + LGTM
>> from an SS contributor is enough to go ahead?
>>
>> https://github.com/apache/spark/pull/27649
>> https://github.com/apache/spark/pull/28363
>>
>> These are under 100 lines of changes per each, and not invasive.
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Seeking committers' help to review on SS PR

2020-11-23 Thread Ryan Blue

I'll go take a look.

While I would generally agree with Sean that it would be appropriate in
this case to commit, I'm very hesitant to set that precedent. I'd prefer to
stick with "review then commit" and, if needed, relax that constraint for
parts of the project that can't get reviewers for a certain period of time.
We did that in another community where there weren't many reviewers and we
wanted to get more people involved, but we put a time limit on it and set
expectations to prevent any perception of abuse. I would support doing that
in SS.

Thanks for being so patient on that PR. I'm sorry that you had to wait so
long.

On Mon, Nov 23, 2020 at 7:11 AM Sean Owen  wrote:

> I don't see any objections on that thread. You're a committer and have
> reviews from other knowledgeable people in this area. Do you have any
> reason to believe it's controversial, like, changes semantics or APIs? Were
> there related discussions elsewhere that expressed any concern?
>
> From a glance, OK it's introducing a new idea of state schema and
> validation; would it conflict with any other possible approaches, have any
> limits if this is enshrined as supported functionality? There's always some
> cost to introducing yet more code to support, but, this doesn't look
> intrusive or large.
>
> The "don't review your own PR" idea isn't hard-and-fast. I don't think
> anyone needs to block for anything like this long if you have other capable
> reviews and you are a committer, if you don't see that it impacts other
> code meaningfully in a way that really demands review from others, and in
> good faith judge that it is worthwhile. I think you are the one de facto
> expert on that code and indeed you can't block yourself for 1.5 years or
> else nothing substantial would happen.
>
>
>
> On Mon, Nov 23, 2020 at 1:18 AM Jungtaek Lim 
> wrote:
>
>> Hi devs,
>>
>> I have been struggling to find reviewers who are committers, to get my PR
>> [1] for SPARK-27237 [2] reviewed. The PR was submitted on Mar. 2019 (1.5
>> years ago), and somehow it got two approvals from contributors working on
>> the SS area, but still doesn't get any committers' traction to review.
>> (I can review others' SS PRs and I'm trying to unblock other SS area
>> contributors, but I can't self review my SS PRs. Not sure it's technically
>> possible, but fully sure it's not encouraged.)
>>
>> Could I please ask help to unblock this before feature freeze for Spark
>> 3.1 is happening? Submitted 1.5 years ago and continues struggling for
>> including it in Spark 3.2 (another half of a year) doesn't make sense to me.
>>
>> In addition, is there a way to unblock me to work for meaningful features
>> instead of being stuck with small improvements? I have something in my
>> backlog but I'd rather not want to continue struggling with new PRs.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://github.com/apache/spark/pull/24173
>> 2. https://issues.apache.org/jira/browse/SPARK-27237
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark 3.1 branch cut 4th Dec?

2020-11-20 Thread Ryan Blue

I think we should be able to get the CREATE TABLE changes in. Now that the
main blocker (EXTERNAL) has been decided, it's just a matter of normal
review comments.

On Fri, Nov 20, 2020 at 9:05 AM Dongjoon Hyun 
wrote:

> Thank you for sharing, Xiao.
>
> I hope we are able to make some agreement for CREATE TABLE DDLs, too.
>
> Bests,
> Dongjoon.
>
> On Fri, Nov 20, 2020 at 9:01 AM Xiao Li  wrote:
>
>> https://github.com/apache/spark/pull/28026 is the major feature I am
>> tracking. It is painful to keep two sets of CREATE TABLE DDLs with
>> different behaviors. This hurts the usability of our SQL users, based on
>> what I heard. Unfortunately, this PR missed Spark 3.0 release. Now, I think
>> we should try our best to address it in 3.1.
>>
>> Thanks,
>>
>> Xiao
>>
>> Xiao Li  于2020年11月20日周五 上午8:52写道：
>>
>>> Hi, Dongjoon,
>>>
>>> Thank you for your feedback. I think *Early December* does not mean we
>>> will cut the branch on Dec 1st. I do not think Dec 1st and Dec 4th are a
>>> big deal. Normally, it would be nice to give enough buffer. Based on my
>>> understanding, this email is just a *proposal* and a *reminder*. In the
>>> past, we often got mixed feedbacks.
>>>
>>> Anyway, we are collecting the feedbacks from the whole community.
>>> Welcome the inputs from everyone else
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> Dongjoon Hyun  于2020年11月20日周五 上午8:33写道：
>>>
>>>> Hi, Xiao.
>>>>
>>>> I agree.
>>>>
>>>> > Merging the feature work after the branch cut should not be
>>>> encouraged in general, although some committers did make some exceptions
>>>> based on their own judgement. We should try to avoid merging the feature
>>>> work after the branch cut.
>>>>
>>>> So, the Apache Spark community accepted your request for delay already.
>>>> (Early November to Early December)
>>>>
>>>> -
>>>> https://github.com/apache/spark-website/commit/0cd0bdc80503882b4737db7e77cc8f9d17ec12ca
>>>>
>>>> I don't think the branch cut should be delayed again. We don't need to
>>>> have two weeks after Hyukjin's email.
>>>>
>>>> Given the delay, I'd strongly recommend to cut the branch on 1st
>>>> December.
>>>>
>>>> I'll create a `branch-3.1` on 1st December if Hyujkjin is busy to start
>>>> to stabilize .
>>>>
>>>> Again, it will not block you if you have an exceptional request.
>>>>
>>>> However, it would be helpful for all of us if you make it clear what
>>>> features you are waiting for now.
>>>>
>>>> We are creating Apache Spark together.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Nov 19, 2020 at 11:38 PM Xiao Li  wrote:
>>>>
>>>>> Correction:
>>>>>
>>>>> Merging the feature work after the branch cut should not be encouraged
>>>>> in general, although some committers did make some exceptions based on
>>>>> their own judgement. We should try to avoid merging the feature work after
>>>>> the branch cut.
>>>>>
>>>>> This email is a good reminder message. At least, we have two weeks
>>>>> ahead of the proposed branch cut date. I hope each feature owner might
>>>>> hurry up and try to finish it before the branch cut.
>>>>>
>>>>> Xiao
>>>>>
>>>>> Xiao Li  于2020年11月19日周四 下午11:36写道：
>>>>>
>>>>>> We should try to merge the feature work after the branch cut. This
>>>>>> should not be encouraged in general, although some committers did make 
>>>>>> some
>>>>>> exceptions based on their own judgement.
>>>>>>
>>>>>> This email is a good reminder message. At least, we have two weeks
>>>>>> ahead of the proposed branch cut date. I hope each feature owner might
>>>>>> hurry up and try to finish it before the branch cut.
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>> Dongjoon Hyun  于2020年11月19日周四 下午4:02写道：
>>>>>>
>>>>>>> Thank you for your volunteering!
>>>>>>>
>>>>>>> Since the previous branch-cuts were always soft-code freeze which
>>>>>>> allowed committers to merge to the new branches still for a while, I
>>>>>>> believe 1st December will be better for stabilization.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 19, 2020 at 3:50 PM Hyukjin Kwon 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I think we haven’t decided yet the exact branch-cut, code freeze
>>>>>>>> and release manager.
>>>>>>>>
>>>>>>>> As we planned in https://spark.apache.org/versioning-policy.html
>>>>>>>>
>>>>>>>> Early Dec 2020 Code freeze. Release branch cut
>>>>>>>>
>>>>>>>> Code freeze and branch cutting is coming.
>>>>>>>>
>>>>>>>> Therefore, we should finish if there are any remaining works for
>>>>>>>> Spark 3.1, and
>>>>>>>> switch to QA mode soon.
>>>>>>>> I think it’s time to set to keep it on track, and I would like to
>>>>>>>> volunteer to help drive this process.
>>>>>>>>
>>>>>>>> I am currently thinking 4th Dec as the branch-cut date.
>>>>>>>>
>>>>>>>> Any thoughts?
>>>>>>>>
>>>>>>>> Thanks all.
>>>>>>>>
>>>>>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: SPIP: Catalog API for view metadata

2020-11-10 Thread Ryan Blue

An extra RPC call is a concern for the catalog implementation. It is simple
to cache the result of a call to avoid a second one if the catalog chooses.

I don't think that an extra RPC that can be easily avoided is a reasonable
justification to add caches in Spark. For one thing, it doesn't solve the
problem because the proposed API still requires separate lookups for tables
and views.

The only solution that would help is to use a combined trait, but that has
issues. For one, view substitution is much cleaner when it happens well
before table resolution. And, View and Table are very different objects;
returning Object from this API doesn't make much sense.

One extra RPC is not unreasonable, and the choice should be left to
sources. That's the easiest place to cache results from the underlying
store.

On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan  wrote:

> Moving back the discussion to this thread. The current argument is how to
> avoid extra RPC calls for catalogs supporting both table and view. There
> are several options:
> 1. ignore it as extra PRC calls are cheap compared to the query execution
> 2. have a per session cache for loaded table/view
> 3. have a per query cache for loaded table/view
> 4. add a new trait TableViewCatalog
>
> I think it's important to avoid perf regression with new APIs. RPC calls
> can be significant for short queries. We may also double the RPC
> traffic which is bad for the metastore service. Normally I would not
> recommend caching as cache invalidation is a hard problem. Personally I
> prefer option 4 as it only affects catalogs that support both table and
> view, and it fits the hive catalog very well.
>
> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge  wrote:
>
>> SPIP
>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>> has been updated. Please review.
>>
>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge  wrote:
>>
>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>
>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan  wrote:
>>>
>>>> Any updates here? I agree that a new View API is better, but we need a
>>>> solution to avoid performance regression. We need to elaborate on the cache
>>>> idea.
>>>>
>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue  wrote:
>>>>
>>>>> I think it is a good idea to keep tables and views separate.
>>>>>
>>>>> The main two arguments I’ve heard for combining lookup into a single
>>>>> function are the ones brought up in this thread. First, an identifier in a
>>>>> catalog must be either a view or a table and should not collide. Second, a
>>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>>> concern is well addressed by caching, which we already do in the Spark
>>>>> catalog, so I’ll primarily focus on the first.
>>>>>
>>>>> Table/view name collision is unlikely to be a problem. Metastores that
>>>>> support both today store them in a single namespace, so this is not a
>>>>> concern for even a naive implementation that talks to the Hive MetaStore. 
>>>>> I
>>>>> know that a new metastore catalog could choose to implement both
>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>>> would be a very strange choice: if the metastore itself has different
>>>>> namespaces for tables and views, then it makes much more sense to expose
>>>>> them through separate catalogs because Spark will always prefer one over
>>>>> the other.
>>>>>
>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>> tables are much more rare than catalogs that only expose one. For example,
>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table interface
>>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are
>>>>> likely to be the same. Say I have a way to convert Pig statements or some
>>>>> other representation into a SQL view. It would make little sense to 
>>>>> combine
>>>>> that with some other TableCatalog.
>>>>>
>>>>> I also don’t think there is benefit from an API perspective to justify
>>>>> combining the Table and View interfaces. The two share only schema and
>>>>> properties, and are handled very differently internally — a Vie

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-10 Thread Ryan Blue

+1, I agree with Tom.

On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun 
wrote:

> +1 for Apache Spark 3.1.0.
>
> Bests,
> Dongjoon.
>
> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves 
> wrote:
>
>> +1 since its a correctness issue, I think its ok to change the behavior
>> to make sure the user is aware of it and let them decide.
>>
>> Tom
>>
>> On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
>> vii...@gmail.com> wrote:
>>
>>
>> Hi devs,
>>
>> In Spark structured streaming, chained stateful operators possibly
>> produces
>> incorrect results under the global watermark. SPARK-33259
>> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
>> demostrating what the correctness issue could be.
>>
>> Currently we don't prevent users running such queries. Because the
>> possible
>> correctness in chained stateful operators in streaming query is not
>> straightforward for users. From users perspective, it will possibly be
>> considered as a Spark bug like SPARK-33259. It is also possible the worse
>> case, users are not aware of the correctness issue and use wrong results.
>>
>> IMO, it is better to disable such queries and let users choose to run the
>> query if they understand there is such risk, instead of implicitly running
>> the query and let users to find out correctness issue by themselves.
>>
>> I would like to propose to disable the streaming query with possible
>> correctness issue in chained stateful operators. The behavior can be
>> controlled by a SQL config, so if users understand the risk and still want
>> to run the query, they can disable the check.
>>
>> In the PR (https://github.com/apache/spark/pull/30210), the concern I got
>> for now is, this changes current behavior and by default it will break
>> some
>> existing streaming queries. But I think it is pretty easy to disable the
>> check with the new config. In the PR currently there is no objection but
>> suggestion to hear more voices. Please let me know if you have some
>> thoughts.
>>
>> Thanks.
>> Liang-Chi Hsieh
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-23 Thread Ryan Blue

I agree. If the user configures an invalid catalog, it should fail and
propagate the exception. Running with a catalog other than the one the user
requested is incorrect.

On Fri, Oct 23, 2020 at 5:24 AM Russell Spitzer 
wrote:

> I was convinced that we should probably just fail, but if that is too much
> of a change, then logging the exception is also acceptable.
>
> On Thu, Oct 22, 2020, 10:32 PM Jungtaek Lim 
> wrote:
>
>> Hi devs,
>>
>> I got another report regarding configuring v2 session catalog, when Spark
>> fails to instantiate the configured catalog. For now, it just simply logs
>> error message without exception information, and silently uses the default
>> session catalog.
>>
>>
>> https://github.com/apache/spark/blob/3819d39607392aa968595e3d97b84fedf83d08d9/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala#L75-L95
>>
>> IMO, as the user intentionally provides the session catalog, it shouldn't
>> fail back and just throw the exception. Otherwise (if we still want to do
>> the failback), we need to add the exception information in the error log
>> message at least.
>>
>> Would like to hear the voices.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue

I don’t think Spark ever claims to be 100% Hive compatible.

I just found some relevant documentation
<https://docs.databricks.com/spark/latest/spark-sql/compatibility/hive.html#apache-hive-compatibility>
on this, where Databricks claims that “Apache Spark SQL in Databricks is
designed to be compatible with the Apache Hive, including metastore
connectivity, SerDes, and UDFs.”

There is a strong expectation that Spark is compatible with Hive, and the
Spark community has made the claim. That isn’t a claim of 100%
compatibility, but no one suggested that there was 100% compatibility.

On Wed, Oct 7, 2020 at 11:54 AM Ryan Blue  wrote:

> I don’t think Spark ever claims to be 100% Hive compatible.
>
> By accepting the EXTERNAL keyword in some circumstances, Spark is
> providing compatibility with Hive DDL. Yes, there are places where it
> breaks. The question is whether we should deliberately break what a Hive
> catalog could implement, when we know what Hive’s behavior is.
>
> CREATE EXTERNAL TABLE is not a Hive-specific feature
>
> Great. So there are other catalogs that could use it. Why should Spark
> choose to limit Hive’s interpretation of this keyword?
>
> While it is great that we seem to agree that Spark shouldn’t do this — now
> that Nessie was pointed out — I’m concerned that you still seem to think
> this is a choice that Spark could reasonably make. *Spark cannot
> arbitrarily choose how to interpret DDL for an external catalog*.
>
> You may not consider this arbitrary because there are other examples where
> location is required. But the Hive community made the choice that these
> clauses are orthogonal, so it is clearly a choice of the external system,
> and it is not Spark’s role to dictate how an external database should
> behave.
>
> I think the Nessie use case is good enough to justify the decoupling of
> EXTERNAL and LOCATION.
>
> It appears that we have consensus. This will be passed to catalogs, which
> can implement the behavior that they choose.
>
> On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan  wrote:
>
>> > I have some hive queries that I want to run on Spark.
>>
>> Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
>> LOCATION can't help you too much here. If you do have this use case, we
>> need a much wider discussion about how to achieve it.
>>
>> For this particular topic, we need concrete use cases like Nessie
>> <https://projectnessie.org/tools/hive/>. It will be great to see more
>> concrete use cases, but I think the Nessie use case is good enough to
>> justify the decoupling of EXTERNAL and LOCATION.
>>
>> BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
>> have it. That's why I think Hive-compatibility alone is not a reasonable
>> use case. For your reference:
>> 1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
>> clause as Spark does: doc
>> <https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html>
>> 2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION
>> clause as Spark does: doc
>> <https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html>
>> 3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or
>> FILE_NAME option: doc
>> <https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r_create_ext_table.html>
>> 4. SQL Server also supports CREATE EXTERNAL TABLE: doc
>> <https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15>
>>
>> > with which Spark claims to be compatible
>>
>> I don't think Spark ever claims to be 100% Hive compatible. In fact, we
>> diverged from Hive intentionally in several places, where we think the Hive
>> behavior was not reasonable and we shouldn't follow it.
>>
>> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue  wrote:
>>
>>> how about LOCATION without EXTERNAL? Currently Spark treats it as an
>>> external table.
>>>
>>> I think there is some confusion about what Spark has to handle.
>>> Regardless of what Spark allows as DDL, these tables can exist in a Hive
>>> MetaStore that Spark connects to, and the general expectation is that Spark
>>> doesn’t change the meaning of table configuration. There are notable bugs
>>> where Spark has different behavior, but that is the expectation.
>>>
>>> In this particular case, we’re talking about what can be expressed in
>>> DDL that is sent to an external catalog. Spark could (unwisely) choose to
>>> disallow some DDL combinations, but the table

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue

I don’t think Spark ever claims to be 100% Hive compatible.

By accepting the EXTERNAL keyword in some circumstances, Spark is providing
compatibility with Hive DDL. Yes, there are places where it breaks. The
question is whether we should deliberately break what a Hive catalog could
implement, when we know what Hive’s behavior is.

CREATE EXTERNAL TABLE is not a Hive-specific feature

Great. So there are other catalogs that could use it. Why should Spark
choose to limit Hive’s interpretation of this keyword?

While it is great that we seem to agree that Spark shouldn’t do this — now
that Nessie was pointed out — I’m concerned that you still seem to think
this is a choice that Spark could reasonably make. *Spark cannot
arbitrarily choose how to interpret DDL for an external catalog*.

You may not consider this arbitrary because there are other examples where
location is required. But the Hive community made the choice that these
clauses are orthogonal, so it is clearly a choice of the external system,
and it is not Spark’s role to dictate how an external database should
behave.

I think the Nessie use case is good enough to justify the decoupling of
EXTERNAL and LOCATION.

It appears that we have consensus. This will be passed to catalogs, which
can implement the behavior that they choose.

On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan  wrote:

> > I have some hive queries that I want to run on Spark.
>
> Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
> LOCATION can't help you too much here. If you do have this use case, we
> need a much wider discussion about how to achieve it.
>
> For this particular topic, we need concrete use cases like Nessie
> <https://projectnessie.org/tools/hive/>. It will be great to see more
> concrete use cases, but I think the Nessie use case is good enough to
> justify the decoupling of EXTERNAL and LOCATION.
>
> BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
> have it. That's why I think Hive-compatibility alone is not a reasonable
> use case. For your reference:
> 1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
> clause as Spark does: doc
> <https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html>
> 2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION
> clause as Spark does: doc
> <https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html>
> 3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME
> option: doc
> <https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r_create_ext_table.html>
> 4. SQL Server also supports CREATE EXTERNAL TABLE: doc
> <https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15>
>
> > with which Spark claims to be compatible
>
> I don't think Spark ever claims to be 100% Hive compatible. In fact, we
> diverged from Hive intentionally in several places, where we think the Hive
> behavior was not reasonable and we shouldn't follow it.
>
> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue  wrote:
>
>> how about LOCATION without EXTERNAL? Currently Spark treats it as an
>> external table.
>>
>> I think there is some confusion about what Spark has to handle.
>> Regardless of what Spark allows as DDL, these tables can exist in a Hive
>> MetaStore that Spark connects to, and the general expectation is that Spark
>> doesn’t change the meaning of table configuration. There are notable bugs
>> where Spark has different behavior, but that is the expectation.
>>
>> In this particular case, we’re talking about what can be expressed in DDL
>> that is sent to an external catalog. Spark could (unwisely) choose to
>> disallow some DDL combinations, but the table is implemented through a
>> plugin so the interpretation is up to the plugin. Spark has no role in
>> choosing how to treat this table, unless it is loaded through Spark’s
>> built-in catalog; in which case, see above.
>>
>> I don’t think Hive compatibility itself is a “use case”.
>>
>> Why?
>>
>> Hive is an external database that defines its own behavior and with which
>> Spark claims to be compatible. If Hive isn’t a valid use case, then why is
>> EXTERNAL supported at all?
>>
>> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau 
>> wrote:
>>
>>>
>>>
>>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan  wrote:
>>>
>>>> I don't think Hive compatibility itself is a "use case".
>>>>
>>> Ok let's add on top of this: I have some hive queries that I want to run
>>> on Spark. I believe that makes

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue

how about LOCATION without EXTERNAL? Currently Spark treats it as an
external table.

I think there is some confusion about what Spark has to handle. Regardless
of what Spark allows as DDL, these tables can exist in a Hive MetaStore
that Spark connects to, and the general expectation is that Spark doesn’t
change the meaning of table configuration. There are notable bugs where
Spark has different behavior, but that is the expectation.

In this particular case, we’re talking about what can be expressed in DDL
that is sent to an external catalog. Spark could (unwisely) choose to
disallow some DDL combinations, but the table is implemented through a
plugin so the interpretation is up to the plugin. Spark has no role in
choosing how to treat this table, unless it is loaded through Spark’s
built-in catalog; in which case, see above.

I don’t think Hive compatibility itself is a “use case”.

Why?

Hive is an external database that defines its own behavior and with which
Spark claims to be compatible. If Hive isn’t a valid use case, then why is
EXTERNAL supported at all?

On Wed, Oct 7, 2020 at 10:17 AM Holden Karau  wrote:

>
>
> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan  wrote:
>
>> I don't think Hive compatibility itself is a "use case".
>>
> Ok let's add on top of this: I have some hive queries that I want to run
> on Spark. I believe that makes it a use case.
>
>> The Nessie <https://projectnessie.org/tools/hive/> example you mentioned
>> is a reasonable use case to me: some frameworks/applications want to create
>> external tables without user-specified location, so that they can manage
>> the table directory themselves and implement fancy features.
>>
>> That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
>> should clearly document that, EXTERNAL and LOCATION are only applicable for
>> file-based data sources, and catalog implementation should fail if the
>> table has EXTERNAL or LOCATION property, but the table provider is not
>> file-based.
>>
>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
>> external table. Hive gives warning when you create managed tables with
>> custom location, which means this behavior is not recommended. Shall we
>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>
>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue 
>> wrote:
>>
>>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>>
>>> The keyword came from Hive and we all agree that a Hive catalog with
>>> Hive behavior can’t be implemented if Spark chooses to couple this with
>>> LOCATION. Why is this use case not a justification?
>>>
>>> Also, the option to keep behavior the same as before is not mutually
>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to have
>>> the same behavior in its catalog. But Spark cannot just choose to break
>>> compatibility with external systems by deciding when to fail certain
>>> combinations of DDL options. Choosing not to allow external without
>>> location when it is valid for Hive prevents building a compatible catalog.
>>>
>>> There are many reasons to build a Hive-compatible catalog. A great
>>> recent example is Nessie <https://projectnessie.org/tools/hive/>, which
>>> enables branching and tagging table states across several table formats and
>>> aims to be compatible with Hive.
>>>
>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan  wrote:
>>>
>>>> > As someone who's had the job of porting different SQL dialects to
>>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>>
>>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options
>>>> we are discussing are:
>>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>>>> with LOCATION (or path option).
>>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>>
>>>> I'm fine with option 2 if there are reasonable use cases. I think it's
>>>> always safer to keep the behavior the same as before. If we want to change
>>>> the behavior and follow option 2, we need use cases to justify it.
>>>>
>>>> For now, the only use case I see is for Hive compatibility and allow
>>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>>>> cases we are targeting?
>>>>
>>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau 
>>>> wrote:
>>>>
>>>>> As someone who's had the job of porting di

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Ryan Blue

I disagree that this is “by design”. An operation like DROP TABLE should
use a v2 drop plan if the table is v2.

If a v2 table is loaded or created using a v2 catalog it should also be
dropped that way. Otherwise, the v2 catalog is not notified when the table
is dropped and can’t perform other necessary updates, like invalidating
caches or dropping state outside of Hive. V2 tables should always use the
v2 API, and I’m not aware of a design where that wasn’t the case.

I’d also say that for DROP TABLE in particular, all calls could use the v2
catalog. We may not want to do this until we are confident as Wenchen said,
but this would be the simpler solution. The v2 catalog can delegate to the
old session catalog, after all.

On Wed, Oct 7, 2020 at 3:48 AM Wenchen Fan  wrote:

> If you just want to save typing the catalog name when writing table names,
> you can set your custom catalog as the default catalog (See
> SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is
> used to extend the v1 session catalog, not replace it.
>
> On Wed, Oct 7, 2020 at 5:36 PM Jungtaek Lim 
> wrote:
>
>> If it's by design and not prepared, then IMHO replacing the default
>> session catalog is better to be restricted until things are sorted out, as
>> it gives pretty much confusion and has known bugs. Actually there's another
>> bug/limitation on default session catalog on the length of identifier,
>> so things that work with custom catalog no longer work when it replaces
>> default session catalog.
>>
>> On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan  wrote:
>>
>>> Ah, this is by design. V1 tables should still go through the v1 session
>>> catalog. I think we can remove this restriction when we are confident about
>>> the new v2 DDL commands that work with v2 catalog APIs.
>>>
>>> On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it
>>>> simply works when I use custom catalog without replacing the default
>>>> catalog).
>>>>
>>>> It just fails on v2 when the "default catalog" is replaced (say I
>>>> replace 'spark_catalog'), because TempViewOrV1Table is providing value even
>>>> with v2 table, and then the catalyst goes with v1 exec. I guess all
>>>> commands leveraging TempViewOrV1Table to determine whether the table is v1
>>>> vs v2 would all suffer from this issue.
>>>>
>>>> On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan  wrote:
>>>>
>>>>> Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE
>>>>> LIKE), so it's possible that some commands still go through the v1 session
>>>>> catalog although you configured a custom v2 session catalog.
>>>>>
>>>>> Can you create JIRA tickets if you hit any DDL commands that don't
>>>>> support v2 catalog? We should fix them.
>>>>>
>>>>> On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> The logical plan for the parsed statement is getting converted either
>>>>>> for old one or v2, and for the former one it keeps using an external
>>>>>> catalog (Hive) - so replacing default session catalog with custom one and
>>>>>> trying to use it like it is in external catalog doesn't work, which
>>>>>> destroys the purpose of replacing the default session catalog.
>>>>>>
>>>>>> Btw I see one approach: in TempViewOrV1Table, if it matches
>>>>>> with SessionCatalogAndIdentifier where the catalog is TableCatalog, call
>>>>>> loadTable in catalog and see whether it's V1 table or not. Not sure it's 
>>>>>> a
>>>>>> viable approach though, as it requires loading a table during resolution 
>>>>>> of
>>>>>> the table identifier.
>>>>>>
>>>>>> On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue  wrote:
>>>>>>
>>>>>>> I've hit this with `DROP TABLE` commands that should be passed to a
>>>>>>> registered v2 session catalog, but are handled by v1. I think that's the
>>>>>>> only case we hit in our downstream test suites, but we haven't been
>>>>>>> exploring the use of a session catalog for fallback. We use v2 for
>>>>>>> everything now, which avoids the problem and comes with multi-catalog
>>>>&

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue

Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive
behavior can’t be implemented if Spark chooses to couple this with LOCATION.
Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually
exclusive with passing EXTERNAL to catalogs. Spark can continue to have the
same behavior in its catalog. But Spark cannot just choose to break
compatibility with external systems by deciding when to fail certain
combinations of DDL options. Choosing not to allow external without
location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent
example is Nessie <https://projectnessie.org/tools/hive/>, which enables
branching and tagging table states across several table formats and aims to
be compatible with Hive.

On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan  wrote:

> > As someone who's had the job of porting different SQL dialects to Spark,
> I'm also very much in favor of keeping EXTERNAL
>
> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we
> are discussing are:
> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with
> LOCATION (or path option).
> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>
> I'm fine with option 2 if there are reasonable use cases. I think it's
> always safer to keep the behavior the same as before. If we want to change
> the behavior and follow option 2, we need use cases to justify it.
>
> For now, the only use case I see is for Hive compatibility and allow
> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
> cases we are targeting?
>
> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau  wrote:
>
>> As someone who's had the job of porting different SQL dialects to Spark,
>> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>> suggestion of leaving it up to the catalogs on how to handle this makes
>> sense.
>>
>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue 
>> wrote:
>>
>>> I would summarize both the problem and the current state differently.
>>>
>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
>>> compatibility with Hive SQL* because all combinations of EXTERNAL and
>>> LOCATION are valid in Hive, but creating an external table with a
>>> default location is not allowed by Spark. Note that Spark must still handle
>>> these tables because it shares a metastore with Hive, which can still
>>> create them.
>>>
>>> Now catalogs can be plugged in, the question is whether to pass the fact
>>> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
>>> handling a create command, or to suppress it and apply Spark’s rule that
>>> LOCATION must be present.
>>>
>>> If it is not passed to the catalog, then a Hive catalog cannot implement
>>> the behavior of Hive SQL, even though Spark added the keyword for Hive
>>> compatibility. The Spark catalog can interpret EXTERNAL however Spark
>>> chooses to, but I think it is a poor choice to force different behavior on
>>> other catalogs.
>>>
>>> Wenchen has also argued that the purpose of this is to standardize
>>> behavior across catalogs. But hiding EXTERNAL would not accomplish that
>>> goal. Whether to physically delete data is a choice that is up to the
>>> catalog. Some catalogs have no “external” concept and will always drop data
>>> when a table is dropped. The ability to keep underlying data files is
>>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>>> the LOCATION clause, or something else is still up to the catalog
>>> implementation.
>>>
>>> I don’t think that there is a good reason to force catalogs to break
>>> compatibility with Hive SQL, while making it appear as though DDL is
>>> compatible. Because removing EXTERNAL would be a breaking change to the
>>> SQL parser, I think the best option is to pass it to v2 catalogs so the
>>> catalog can decide how to handle it.
>>>
>>> rb
>>>
>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan  wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to start a discussion thread about this topic, as it blocks an
>>>> important feature that we target for Spark 3.1: unify the CREATE TABLE

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-06 Thread Ryan Blue

I've hit this with `DROP TABLE` commands that should be passed to a
registered v2 session catalog, but are handled by v1. I think that's the
only case we hit in our downstream test suites, but we haven't been
exploring the use of a session catalog for fallback. We use v2 for
everything now, which avoids the problem and comes with multi-catalog
support.

On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim 
wrote:

> Hi devs,
>
> I'm not sure whether it's addressed in Spark 3.1, but at least from Spark
> 3.0.1, many SQL DDL statements don't seem to go through the custom catalog
> when I replace default catalog with custom catalog and only provide
> 'dbName.tableName' as table identifier.
>
> I'm not an expert in this area, but after skimming the code I feel
> TempViewOrV1Table looks to be broken for the case, as it can still be a V2
> table. Classifying the table identifier to either V2 table or "temp view or
> v1 table" looks to be mandatory, as former and latter have different code
> paths and different catalog interfaces.
>
> That sounds to me as being stuck and the only "clear" approach seems to
> disallow default catalog with custom one. Am I missing something?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Official support of CREATE EXTERNAL TABLE

2020-10-06 Thread Ryan Blue

I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
SQL, but Spark’s built-in catalog doesn’t allow creating a table with
EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
compatibility with Hive SQL* because all combinations of EXTERNAL and
LOCATION are valid in Hive, but creating an external table with a default
location is not allowed by Spark. Note that Spark must still handle these
tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact
that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling
a create command, or to suppress it and apply Spark’s rule that LOCATION
must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement
the behavior of Hive SQL, even though Spark added the keyword for Hive
compatibility. The Spark catalog can interpret EXTERNAL however Spark
chooses to, but I think it is a poor choice to force different behavior on
other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior
across catalogs. But hiding EXTERNAL would not accomplish that goal.
Whether to physically delete data is a choice that is up to the catalog.
Some catalogs have no “external” concept and will always drop data when a
table is dropped. The ability to keep underlying data files is specific to
a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION
clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break
compatibility with Hive SQL, while making it appear as though DDL is
compatible. Because removing EXTERNAL would be a breaking change to the SQL
parser, I think the best option is to pass it to v2 catalogs so the catalog
can decide how to handle it.

rb

On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan  wrote:

> Hi all,
>
> I'd like to start a discussion thread about this topic, as it blocks an
> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
> syntax.
>
> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
> feature in Spark for Hive compatibility.
>
> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
> ... USING parquet`, the parser fails and tells you that EXTERNAL can't be
> specified.
>
> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
> LOCATION clause or path option is present. For example, `CREATE EXTERNAL
> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION
> clause or path option. This is not 100% Hive compatible.
>
> As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal
> with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was,
> or we can officially support it.
>
> Please let us know your thoughts:
> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
> you used it in production before? For what use cases?
> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
> It seems to me that it only makes sense for file source, as the table
> directory can be managed. I'm not sure how to interpret EXTERNAL in
> catalogs like jdbc, cassandra, etc.
>
> For more details, please refer to the long discussion in
> https://github.com/apache/spark/pull/28026
>
> Thanks,
> Wenchen
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Ryan Blue

Before, the input was a byte array so we could read from it directly. Now,
the input is a `ByteBufferInputStream` so that Parquet can choose how to
allocate buffers. For example, we use vectored reads from S3 that pull back
multiple buffers in parallel.

Now that the input is a stream based on possibly multiple byte buffers, it
provides a method to get a buffer of a certain length. In most cases, that
will create a ByteBuffer with the same backing byte array, but it may need
to copy if the request spans multiple buffers in the stream. Most of the
time, the call to `slice` only requires duplicating the buffer and setting
its limit, but a read that spans multiple buffers is expensive. It would be
helpful to know whether the time spent is copying data, which would
indicate the backing buffers are too small, or whether it is spent
duplicating the backing byte buffer.

On Mon, Sep 14, 2020 at 5:29 AM Sean Owen  wrote:

> Ryan do you happen to have any opinion there? that particular section
> was introduced in the Parquet 1.10 update:
>
> https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20
> It looks like it didn't use to make a ByteBuffer each time, but read from
> in.
>
> On Sun, Sep 13, 2020 at 10:48 PM Chang Chen  wrote:
> >
> > I think we can copy all encoded data into a ByteBuffer once, and unpack
> values in the loop
> >
> >  while (valueIndex < this.currentCount) {
> > // values are bit packed 8 at a time, so reading bitWidth will
> always work
> > this.packer.unpack8Values(buffer, buffer.position() + valueIndex,
> this.currentBuffer, valueIndex);
> > valueIndex += 8;
> >   }
> >
> > Sean Owen  于2020年9月14日周一 上午10:40写道：
> >>
> >> It certainly can't be called once - it's reading different data each
> time.
> >> There might be a faster way to do it, I don't know. Do you have ideas?
> >>
> >> On Sun, Sep 13, 2020 at 9:25 PM Chang Chen 
> wrote:
> >> >
> >> > Hi export
> >> >
> >> > it looks like there is a hot spot in
> VectorizedRleValuesReader#readNextGroup()
> >> >
> >> > case PACKED:
> >> >   int numGroups = header >>> 1;
> >> >   this.currentCount = numGroups * 8;
> >> >
> >> >   if (this.currentBuffer.length < this.currentCount) {
> >> > this.currentBuffer = new int[this.currentCount];
> >> >   }
> >> >   currentBufferIdx = 0;
> >> >   int valueIndex = 0;
> >> >   while (valueIndex < this.currentCount) {
> >> > // values are bit packed 8 at a time, so reading bitWidth will
> always work
> >> > ByteBuffer buffer = in.slice(bitWidth);
> >> > this.packer.unpack8Values(buffer, buffer.position(),
> this.currentBuffer, valueIndex);
> >> > valueIndex += 8;
> >> >   }
> >> >
> >> >
> >> > Per my profile, the codes will spend 30% time of readNextGrou() on
> slice ， why we can't call slice out of the loop？
>

-- 
Ryan Blue

Re: SPIP: Catalog API for view metadata

2020-08-19 Thread Ryan Blue

I think it is a good idea to keep tables and views separate.

The main two arguments I’ve heard for combining lookup into a single
function are the ones brought up in this thread. First, an identifier in a
catalog must be either a view or a table and should not collide. Second, a
single lookup is more likely to require a single RPC. I think the RPC
concern is well addressed by caching, which we already do in the Spark
catalog, so I’ll primarily focus on the first.

Table/view name collision is unlikely to be a problem. Metastores that
support both today store them in a single namespace, so this is not a
concern for even a naive implementation that talks to the Hive MetaStore. I
know that a new metastore catalog could choose to implement both
ViewCatalog and TableCatalog and store the two sets separately, but that
would be a very strange choice: if the metastore itself has different
namespaces for tables and views, then it makes much more sense to expose
them through separate catalogs because Spark will always prefer one over
the other.

In a similar line of reasoning, catalogs that expose both views and tables
are much more rare than catalogs that only expose one. For example, v2
catalogs for JDBC and Cassandra expose data through the Table interface and
implementing ViewCatalog would make little sense. Exposing new data sources
to Spark requires TableCatalog, not ViewCatalog. View catalogs are likely
to be the same. Say I have a way to convert Pig statements or some other
representation into a SQL view. It would make little sense to combine that
with some other TableCatalog.

I also don’t think there is benefit from an API perspective to justify
combining the Table and View interfaces. The two share only schema and
properties, and are handled very differently internally — a View’s SQL
query is parsed and substituted into the plan, while a Table is wrapped in
a relation that eventually becomes a Scan node using SupportsRead. A view’s
SQL also needs additional context to be resolved correctly: the current
catalog and namespace from the time the view was created.

Query planning is distinct between tables and views, so Spark doesn’t
benefit from combining them. I think it has actually caused problems that
both were resolved by the same method in v1: the resolution rule grew
extremely complicated trying to look up a reference just once because it
had to parse a view plan and resolve relations within it using the view’s
context (current database). In contrast, John’s new view substitution rules
are cleaner and can stay within the substitution batch.

People implementing views would also not benefit from combining the two
interfaces:

   - There is little overlap between View and Table, only schema and
   properties
   - Most catalogs won’t implement both interfaces, so returning a
   ViewOrTable is more difficult for implementations
   - TableCatalog assumes that ViewCatalog will be added separately like
   John proposes, so we would have to break or replace that API

I understand the initial appeal of combining TableCatalog and ViewCatalog
since it is done that way in the existing interfaces. But I think that Hive
chose to do that mostly on the fact that the two were already stored
together, and not because it made sense for users of the API, or any other
implementer of the API.

rb

On Tue, Aug 18, 2020 at 9:46 AM John Zhuge  wrote:

>
>
>
>> > AFAIK view schema is only used by DESCRIBE.
>>
>> Correction: Spark adds a new Project at the top of the parsed plan from
>> view, based on the stored schema, to make sure the view schema doesn't
>> change.
>>
>
> Thanks Wenchen! I thought I forgot something :) Yes it is the validation
> done in *checkAnalysis*:
>
>   // If the view output doesn't have the same number of columns
> neither with the child
>   // output, nor with the query column names, throw an
> AnalysisException.
>   // If the view's child output can't up cast to the view output,
>   // throw an AnalysisException, too.
>
> The view output comes from the schema:
>
>   val child = View(
> desc = metadata,
> output = metadata.schema.toAttributes,
> child = parser.parsePlan(viewText))
>
> So it is a validation (here) or cache (in DESCRIBE) nice to have but not
> "required" or "should be frozen". Thanks Ryan and Burak for pointing that
> out in SPIP. I will add a new paragraph accordingly.
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: SPIP: Catalog API for view metadata

2020-08-13 Thread Ryan Blue

om Hive Metastore and support
>>>> different storage backends, I am proposing a new view catalog API to load,
>>>> create, alter, and drop views.
>>>>
>>>> Document:
>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>
>>>> As part of a project to support common views across query engines like
>>>> Spark and Presto, my team used the view catalog API in Spark
>>>> implementation. The project has been in production over three months.
>>>>
>>>> Thanks,
>>>> John Zhuge
>>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> John Zhuge
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Decommissioning SPIP

2020-07-02 Thread Ryan Blue

+1

On Thu, Jul 2, 2020 at 8:00 AM Dongjoon Hyun 
wrote:

> +1.
>
> Thank you, Holden.
>
> Bests,
> Dongjoon.
>
> On Thu, Jul 2, 2020 at 6:43 AM wuyi  wrote:
>
>> +1 for having this feature in Spark
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Removing references to slave (and maybe in the future master)

2020-06-19 Thread Ryan Blue

Thanks for getting this started! I think it will be worth the effort, and
it's great to get started early in the 3.x release line to give the most
time to prepare for this.

On Thu, Jun 18, 2020 at 3:44 PM Russell Spitzer 
wrote:

> I really dislike the use of "worker" in the code base since it describes a
> process which doesn't actually do work, but I don't think it's in the scope
> for this ticket. I would definitely prefer we use "agent" instead of
> "worker" (or some other name) and have master switched to something like
> "resource manager" or something that actually describes the purpose of the
> process.
>
> I realize that touching "master" is going to disrupt just about everything
> but these name choices are usually the first thing that trips up new Spark
> Users. In my experience, I usually have to spend at least 15-20
> minutes explaining that a worker will not actually do work, and the master
> won't run their application.
>
> Thanks Holden for doing all the legwork on this!
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Ryan Blue

+1 for a 2.x release with a DSv2 API that matches 3.0.

There are a lot of big differences between the API in 2.4 and 3.0, and I
think a release to help migrate would be beneficial to organizations like
ours that will be supporting 2.x and 3.0 in parallel for quite a while.
Migration to Spark 3 is going to take time as people build confidence in
it. I don't think that can be avoided by leaving a larger feature gap
between 2.x and 3.0.

On Fri, Jun 12, 2020 at 5:53 PM Xiao Li  wrote:

> Based on my understanding, DSV2 is not stable yet. It still misses various
> features. Even our built-in file sources are still unable to fully migrate
> to DSV2. We plan to enhance it in the next few releases to close the gap.
>
> Also, the changes on DSV2 in Spark 3.0 did not break any existing
> application. We should encourage more users to try Spark 3 and increase the
> adoption of Spark 3.x.
>
> Xiao
>
> On Fri, Jun 12, 2020 at 5:36 PM Holden Karau  wrote:
>
>> So I one of the things which we’re planning on backporting internally is
>> DSv2, which I think being available in a community release in a 2 branch
>> would be more broadly useful. Anything else on top of that would be on a
>> case by case basis for if they make an easier upgrade path to 3.
>>
>> If we’re worried about people using 2.5 as a long term home we could
>> always mark it with “-transitional” or something similar?
>>
>> On Fri, Jun 12, 2020 at 4:33 PM Sean Owen  wrote:
>>
>>> What is the functionality that would go into a 2.5.0 release, that can't
>>> be in a 2.4.7 release? I think that's the key question. 2.4.x is the 2.x
>>> maintenance branch, and I personally could imagine being open to more
>>> freely backporting a few new features for 2.x users, whereas usually it's
>>> only bug fixes. Making 2.5.0 implies that 2.5.x is the 2.x maintenance
>>> branch but there's something too big for a 'normal' maintenance release,
>>> and I think the whole question turns on what that is.
>>>
>>> If it's things like JDK 11 support, I think that is unfortunately fairly
>>> 'breaking' because of dependency updates. But maybe that's not it.
>>>
>>>
>>> On Fri, Jun 12, 2020 at 4:38 PM Holden Karau 
>>> wrote:
>>>
>>>> Hi Folks,
>>>>
>>>> As we're getting closer to Spark 3 I'd like to revisit a Spark 2.5
>>>> release. Spark 3 brings a number of important changes, and by its nature is
>>>> not backward compatible. I think we'd all like to have as smooth an upgrade
>>>> experience to Spark 3 as possible, and I believe that having a Spark 2
>>>> release some of the new functionality while continuing to support the older
>>>> APIs and current Scala version would make the upgrade path smoother.
>>>>
>>>> This pattern is not uncommon in other Hadoop ecosystem projects, like
>>>> Hadoop itself and HBase.
>>>>
>>>> I know that Ryan Blue has indicated he is already going to be
>>>> maintaining something like that internally at Netflix, and we'll be doing
>>>> the same thing at Apple. It seems like having a transitional release could
>>>> benefit the community with easy migrations and help avoid duplicated work.
>>>>
>>>> I want to be clear I'm volunteering to do the work of managing a 2.5
>>>> release, so hopefully, this wouldn't create any substantial burdens on the
>>>> community.
>>>>
>>>> Cheers,
>>>>
>>>> Holden
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Ryan Blue

+1 (non-binding)

On Tue, Jun 9, 2020 at 4:14 PM Tathagata Das 
wrote:

> +1 (binding)
>
> On Tue, Jun 9, 2020 at 5:27 PM Burak Yavuz  wrote:
>
>> +1
>>
>> Best,
>> Burak
>>
>> On Tue, Jun 9, 2020 at 1:48 PM Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> +1 (binding)
>>>
>>> Best Regards,
>>> Ryan
>>>
>>>
>>> On Tue, Jun 9, 2020 at 4:24 AM Wenchen Fan  wrote:
>>>
>>>> +1 (binding)
>>>>
>>>> On Tue, Jun 9, 2020 at 6:15 PM Dr. Kent Yao  wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-23 Thread Ryan Blue

Hyukjin, you're right that I could have looked more closely. Sorry for
that. I definitely should have been more careful.

rb

On Fri, May 22, 2020 at 5:19 PM Hyukjin Kwon  wrote:

> Ryan,
>
> > I'm fine with the commit, other than the fact that it violated ASF norms
> <https://www.apache.org/foundation/voting.html> to commit without waiting
> for a review.
>
> Looks it became the different proposal as you and other people discussed
> and suggested there, which you didn't technically vote for.
> It seems reviewed properly by other committers, and I see you were pinged
> multiple times.
> It might be best to read it carefully before posting it on the RC vote
> thread.
>
>
> 2020년 5월 23일 (토) 오전 6:55, 王斐 님이 작성:
>
>> Hi all,
>> Can we help review this pr and resolve this issue before spark-3.0 RC3.
>> This is a fault tolerance bug in spark. not as serious as a correctness
>> issue, but pretty high up.( I just cite the comment,
>> https://github.com/apache/spark/pull/26339#issuecomment-632707720).
>> https://issues.apache.org/jira/browse/SPARK-29302
>> https://github.com/apache/spark/pull/26339
>>
>> Thanks a lot.
>>
>> Reynold Xin  于2020年5月19日周二 上午4:43写道：
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.0.0.
>>>
>>> The vote is open until Thu May 21 11:59pm Pacific time and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.0.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.0.0-rc2 (commit
>>> 29853eca69bceefd227cbe8421a09c116b7b753a):
>>> https://github.com/apache/spark/tree/v3.0.0-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc2-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1345/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc2-docs/
>>>
>>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>
>>> This release is using the release script of the tag v3.0.0-rc2.
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.0.0?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.0.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DatasourceV2] Default Mode for DataFrameWriter not Dependent on DataSource Version

2020-05-20 Thread Ryan Blue

The context on this is that it was confusing that the mode changed, which
introduced different behaviors for the same user code when moving from v1
to v2. Burak pointed this out and I agree that it's weird that if your
dependency changes from v1 to v2, your compiled Spark job starts appending
instead of erroring out when the table exists.

The work-around is to implement a new trait, SupportsCatalogOptions, that
allows you to extract a table identifier and catalog name from the options
in the DataFrameReader. That way, you can re-route to your catalog so that
Spark correctly uses a CreateTableAsSelect statement for ErrorIfExists
mode.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsCatalogOptions.java

On Wed, May 20, 2020 at 2:50 PM Russell Spitzer 
wrote:

>
> While the ScalaDocs for DataFrameWriter say
>
> /**
>  * Specifies the behavior when data or table already exists. Options include:
>  * 
>  * `SaveMode.Overwrite`: overwrite the existing data.
>  * `SaveMode.Append`: append the data.
>  * `SaveMode.Ignore`: ignore the operation (i.e. no-op).
>  * `SaveMode.ErrorIfExists`: throw an exception at runtime.
>  * 
>  * 
>  * When writing to data source v1, the default option is `ErrorIfExists`. 
> When writing to data
>  * source v2, the default option is `Append`.
>  *
>  * @since 1.4.0
>  */
>
>
> As far as I can tell, using DataFrame writer with a TableProviding
> DataSource V2 will still default to ErrorIfExists which breaks existing
> code since DSV2 cannot support ErrorIfExists mode. I noticed in the history
> of DataframeWriter there were versions which differentiated between DSV2
> and DSV1 and set the mode accordingly but this seems to no longer be the
> case. Was this intentional? I feel like if we could
> have the default be based on the Source then upgrading code from DSV1 ->
> DSV2 would be much easier for users.
>
> I'm currently testing this on RC2
>
>
> Any thoughts?
>
> Thanks for your time as usual,
> Russ
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Ryan Blue

Okay, I took a look at the PR and I think it should be okay. The new
classes are unfortunately public, but are in catalyst which is considered
private. So this is the approach we discussed.

I'm fine with the commit, other than the fact that it violated ASF norms
<https://www.apache.org/foundation/voting.html> to commit without waiting
for a review.

On Wed, May 20, 2020 at 10:00 AM Ryan Blue  wrote:

> Why was https://github.com/apache/spark/pull/28523 merged with a -1? We
> discussed this months ago and concluded that it was a bad idea to introduce
> a new v2 API that cannot have reliable behavior across sources.
>
> The last time I checked that PR, the approach I discussed with Tathagata
> was to not add update mode to DSv2. Instead, Tathagata gave a couple of
> reasonable options to avoid it. Why were those not done?
>
> This is the second time this year that a PR with a -1 was merged. Does the
> Spark community not follow the convention to build consensus before merging
> changes?
>
> On Wed, May 20, 2020 at 12:13 AM Wenchen Fan  wrote:
>
>> Seems the priority of SPARK-31706 is incorrectly marked, and it's a
>> blocker now. The fix was merged just a few hours ago.
>>
>> This should be a -1 for RC2.
>>
>> On Wed, May 20, 2020 at 2:42 PM rickestcode 
>> wrote:
>>
>>> +1
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -----
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Ryan Blue

Why was https://github.com/apache/spark/pull/28523 merged with a -1? We
discussed this months ago and concluded that it was a bad idea to introduce
a new v2 API that cannot have reliable behavior across sources.

The last time I checked that PR, the approach I discussed with Tathagata
was to not add update mode to DSv2. Instead, Tathagata gave a couple of
reasonable options to avoid it. Why were those not done?

This is the second time this year that a PR with a -1 was merged. Does the
Spark community not follow the convention to build consensus before merging
changes?

On Wed, May 20, 2020 at 12:13 AM Wenchen Fan  wrote:

> Seems the priority of SPARK-31706 is incorrectly marked, and it's a
> blocker now. The fix was merged just a few hours ago.
>
> This should be a -1 for RC2.
>
> On Wed, May 20, 2020 at 2:42 PM rickestcode 
> wrote:
>
>> +1
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

2020-05-14 Thread Ryan Blue

ACCEPT_ANY_SCHEMA isn't a good way to solve the problem because you often
want at least some checking in Spark to validate the rows match. It's a
good way to be unblocked, but not a long-term solution.

On Thu, May 14, 2020 at 4:57 AM Russell Spitzer 
wrote:

> Yeah! That is working for me. Thanks!
>
> On Thu, May 14, 2020 at 12:10 AM Wenchen Fan  wrote:
>
>> I think we already have this table capacity: ACCEPT_ANY_SCHEMA. Can you
>> try that?
>>
>> On Thu, May 14, 2020 at 6:17 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I would really appreciate that, I'm probably going to just write a
>>> planner rule for now which matches up my table schema with the query output
>>> if they are valid, and fails analysis otherwise. This approach is how I got
>>> metadata columns in so I believe it would work for writing as well.
>>>
>>> On Wed, May 13, 2020 at 5:13 PM Ryan Blue  wrote:
>>>
>>>> I agree with adding a table capability for this. This is something that
>>>> we support in our Spark branch so that users can evolve tables without
>>>> breaking existing ETL jobs -- when you add an optional column, it shouldn't
>>>> fail the existing pipeline writing data to a table. I can contribute the
>>>> changes to validation if people are interested.
>>>>
>>>> On Wed, May 13, 2020 at 2:57 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> In DSV1 this was pretty easy to do because of the burden of
>>>>> verification for writes had to be in the datasource, the new setup makes
>>>>> partial writes difficult.
>>>>>
>>>>> resolveOuptutColumns checks the table schema against the writeplan's
>>>>> output and will fail any requests which don't contain every column as
>>>>> specified in the table schema.
>>>>> I would like it if instead if either we made this check optional for a
>>>>> datasource, perhaps an "allow partial writes" trait for the table? Or just
>>>>> allowed analysis
>>>>> to fail on "withInputDataSchema" where an implementer could throw
>>>>> exceptions on underspecified writes.
>>>>>
>>>>>
>>>>> The use case here is that C* (and many other sinks) have mandated
>>>>> columns that must be present during an insert as well as those
>>>>> which are not required.
>>>>>
>>>>> Please let me know if i've misread this,
>>>>>
>>>>> Thanks for your time again,
>>>>> Russ
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

2020-05-13 Thread Ryan Blue

I agree with adding a table capability for this. This is something that we
support in our Spark branch so that users can evolve tables without
breaking existing ETL jobs -- when you add an optional column, it shouldn't
fail the existing pipeline writing data to a table. I can contribute the
changes to validation if people are interested.

On Wed, May 13, 2020 at 2:57 PM Russell Spitzer 
wrote:

> In DSV1 this was pretty easy to do because of the burden of verification
> for writes had to be in the datasource, the new setup makes partial writes
> difficult.
>
> resolveOuptutColumns checks the table schema against the writeplan's
> output and will fail any requests which don't contain every column as
> specified in the table schema.
> I would like it if instead if either we made this check optional for a
> datasource, perhaps an "allow partial writes" trait for the table? Or just
> allowed analysis
> to fail on "withInputDataSchema" where an implementer could throw
> exceptions on underspecified writes.
>
>
> The use case here is that C* (and many other sinks) have mandated columns
> that must be present during an insert as well as those
> which are not required.
>
> Please let me know if i've misread this,
>
> Thanks for your time again,
> Russ
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Ryan Blue

+1 for the approach Jungtaek suggests. That will avoid needing to support
behavior that is not well understood with minimal changes.

On Tue, May 12, 2020 at 1:45 AM Jungtaek Lim 
wrote:

> Before I forget, we'd better not forget to change the doc, as create table
> doc looks to represent current syntax which will be incorrect later.
>
> On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim 
> wrote:
>
>> It's not only for end users, but also for us. Spark itself uses the
>> config "true" and "false" in tests and it still brings confusion. We still
>> have to deal with both situations.
>>
>> I'm wondering how long days it would be needed to revert it cleanly, but
>> if we worry about the amount of code change just around the new RC, what
>> about make the code dirty (should be fixed soon) but less headache via
>> applying traditional (and bad) way?
>>
>> Let's just remove the config so that the config cannot be used in any way
>> (even in Spark codebase), and set corresponding field in parser to the
>> constant value so that no one can modify in any way. This would make the
>> dead code by intention which should be cleaned it up later, so let's add
>> FIXME comment there so that anyone can take it up for cleaning up the code
>> later. (If no one volunteers then I'll probably pick up.)
>>
>> That is a bad pattern, but still better as we prevent end users (even
>> early adopters) go through the undocumented path in any way, and that will
>> be explicitly marked as "should be fixed". This is different from retaining
>> config - I don't expect unified create table syntax will be landed in
>> bugfix version, so even unified create table syntax can be landed in 3.1.0
>> (this is also not guaranteed) the config will live in 3.0.x in any way. If
>> we temporarily go dirty way then we can clean up the code in any version,
>> even from bugfix version, maybe within a couple of weeks just after 3.0.0
>> is released.
>>
>> Does it sound valid?
>>
>> On Tue, May 12, 2020 at 2:35 PM Wenchen Fan  wrote:
>>
>>> SPARK-30098 was merged about 6 months ago. It's not a clean revert and
>>> we may need to spend quite a bit of time to resolve conflicts and fix tests.
>>>
>>> I don't see why it's still a problem if a feature is disabled and hidden
>>> from end-users (it's undocumented, the config is internal). The related
>>> code will be replaced in the master branch sooner or later, when we unify
>>> the syntaxes.
>>>
>>>
>>>
>>> On Tue, May 12, 2020 at 6:16 AM Ryan Blue 
>>> wrote:
>>>
>>>> I'm all for getting the unified syntax into master. The only issue
>>>> appears to be whether or not to pass the presence of the EXTERNAL keyword
>>>> through to a catalog in v2. Maybe it's time to start a discuss thread for
>>>> that issue so we're not stuck for another 6 weeks on it.
>>>>
>>>> On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Btw another wondering here is, is it good to retain the flag on master
>>>>> as an intermediate step? Wouldn't it be better for us to start "unified
>>>>> create table syntax" from scratch?
>>>>>
>>>>>
>>>>> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> I'm sorry, but I have to agree with Ryan and Russell. I chose the
>>>>>> option 1 because it's less worse than option 2, but it doesn't mean I 
>>>>>> fully
>>>>>> agree with option 1.
>>>>>>
>>>>>> Let's make below things clear if we really go with option 1,
>>>>>> otherwise please consider reverting it.
>>>>>>
>>>>>> * Do you fully indicate about "all" the paths where the second create
>>>>>> table syntax is taken?
>>>>>> * Could you explain "why" to end users without any confusion? Do you
>>>>>> think end users will understand it easily?
>>>>>> * Do you have an actual end users to guide to turn this on? Or do you
>>>>>> have a plan to turn this on for your team/customers and deal with
>>>>>> the ambiguity?
>>>>>> * Could you please document about how things will change if the flag
>>>>>> is turned on?
>>>>>>
>>>>>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Ryan Blue

I'm all for getting the unified syntax into master. The only issue appears
to be whether or not to pass the presence of the EXTERNAL keyword through
to a catalog in v2. Maybe it's time to start a discuss thread for that
issue so we're not stuck for another 6 weeks on it.

On Mon, May 11, 2020 at 3:13 PM Jungtaek Lim 
wrote:

> Btw another wondering here is, is it good to retain the flag on master as
> an intermediate step? Wouldn't it be better for us to start "unified create
> table syntax" from scratch?
>
>
> On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim 
> wrote:
>
>> I'm sorry, but I have to agree with Ryan and Russell. I chose the option
>> 1 because it's less worse than option 2, but it doesn't mean I fully agree
>> with option 1.
>>
>> Let's make below things clear if we really go with option 1, otherwise
>> please consider reverting it.
>>
>> * Do you fully indicate about "all" the paths where the second create
>> table syntax is taken?
>> * Could you explain "why" to end users without any confusion? Do you
>> think end users will understand it easily?
>> * Do you have an actual end users to guide to turn this on? Or do you
>> have a plan to turn this on for your team/customers and deal with
>> the ambiguity?
>> * Could you please document about how things will change if the flag is
>> turned on?
>>
>> I guess the option 1 is to leave a flag as "undocumented" one and forget
>> about the path to turn on, but I think that would lead to make the
>> feature be "broken window" even we are not able to touch.
>>
>> On Tue, May 12, 2020 at 6:45 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I think reverting 30098 is the right decision here if we want to unblock
>>> 3.0. We shouldn't ship with features which we know do not function in the
>>> way we intend, regardless of how little exposure most users have to them.
>>> Even if it's off my default, we should probably work to avoid switches that
>>> cause things to behave unpredictably or require a flow chart to actually
>>> determine what will happen.
>>>
>>> On Mon, May 11, 2020 at 3:07 PM Ryan Blue 
>>> wrote:
>>>
>>>> I'm all for fixing behavior in master by turning this off as an
>>>> intermediate step, but I don't think that Spark 3.0 can safely include
>>>> SPARK-30098.
>>>>
>>>> The problem is that SPARK-30098 introduces strange behavior, as
>>>> Jungtaek pointed out. And that behavior is not fully understood. While
>>>> working on a unified CREATE TABLE syntax, I hit additional test
>>>> failures
>>>> <https://github.com/apache/spark/pull/28026#issuecomment-606967363>
>>>> where the wrong create path was being used.
>>>>
>>>> Unless we plan to NOT support the behavior
>>>> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
>>>> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
>>>> with this problem for years to come.
>>>>
>>>> On Mon, May 11, 2020 at 1:06 AM JackyLee  wrote:
>>>>
>>>>> +1. Agree with Xiao Li and Jungtaek Lim.
>>>>>
>>>>> This seems to be controversial, and can not be done in a short time.
>>>>> It is
>>>>> necessary to choose option 1 to unblock Spark 3.0 and support it in
>>>>> 3.1.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Ryan Blue

I'm all for fixing behavior in master by turning this off as an
intermediate step, but I don't think that Spark 3.0 can safely include
SPARK-30098.

The problem is that SPARK-30098 introduces strange behavior, as Jungtaek
pointed out. And that behavior is not fully understood. While working on a
unified CREATE TABLE syntax, I hit additional test failures
<https://github.com/apache/spark/pull/28026#issuecomment-606967363> where
the wrong create path was being used.

Unless we plan to NOT support the behavior
when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we
should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal
with this problem for years to come.

On Mon, May 11, 2020 at 1:06 AM JackyLee  wrote:

> +1. Agree with Xiao Li and Jungtaek Lim.
>
> This seems to be controversial, and can not be done in a short time. It is
> necessary to choose option 1 to unblock Spark 3.0 and support it in 3.1.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Ryan Blue

nt and prevailing in the
>>>> existing codebase, for example, see StateOperatorProgress and
>>>> StreamingQueryProgress in Structured Streaming.
>>>> However, I realised that we also have other approaches in the current
>>>> codebase. There look
>>>> four approaches to deal with Java specifics in general:
>>>>
>>>>1. Java specific classes such as JavaRDD and JavaSparkContext.
>>>>2. Java specific methods with the same name that overload its
>>>>parameters, see functions.scala.
>>>>3. Java specific methods with a different name that needs to return
>>>>a different type such as TaskContext.resourcesJMap vs
>>>>TaskContext.resources.
>>>>4. One method that returns a Java instance for both Scala and Java
>>>>sides. see StateOperatorProgress and StreamingQueryProgress.
>>>>
>>>>
>>>> *Analysis on the current codebase:*
>>>>
>>>> I agree with 2. approach because the corresponding cases give you a
>>>> consistent API usage across
>>>> other language APIs in general. Approach 1. is from the old world when
>>>> we didn't have unified APIs.
>>>> This might be the worst approach.
>>>>
>>>> 3. and 4. are controversial.
>>>>
>>>> For 3., if you have to use Java APIs, then, you should search if there
>>>> is a variant of that API
>>>> every time specifically for Java APIs. But yes, it gives you Java/Scala
>>>> friendly instances.
>>>>
>>>> For 4., having one API that returns a Java instance makes you able to
>>>> use it in both Scala and Java APIs
>>>> sides although it makes you call asScala in Scala side specifically.
>>>> But you don’t
>>>> have to search if there’s a variant of this API and it gives you a
>>>> consistent API usage across languages.
>>>>
>>>> Also, note that calling Java in Scala is legitimate but the opposite
>>>> case is not, up to my best knowledge.
>>>> In addition, you should have a method that returns a Java instance for
>>>> PySpark or SparkR to support.
>>>>
>>>>
>>>> *Proposal:*
>>>>
>>>> I would like to have a general guidance on this that the Spark dev
>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all
>>>> cost.
>>>>
>>>> Note that this isn't a hard requirement but *a general guidance*;
>>>> therefore, the decision might be up to
>>>> the specific context. For example, when there are some strong arguments
>>>> to have a separate Java specific API, that’s fine.
>>>> Of course, we won’t change the existing methods given Micheal’s rubric
>>>> added before. I am talking about new
>>>> methods in unreleased branches.
>>>>
>>>> Any concern or opinion on this?
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Ryan Blue

Hi Andrew,

With DataSourceV2, I recommend plugging in a catalog instead of using
DataSource. As you've noticed, the way that you plug in data sources isn't
very flexible. That's one of the reasons why we changed the plugin system
and made it possible to use named catalogs that load implementations based
on configuration properties.

I think it's fine to consider how to patch the registration trait, but I
really don't recommend continuing to identify table implementations
directly by name.

On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo  wrote:

> Hi all,
>
> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
> send an email to the dev list for discussion.
>
> As the DSv2 API evolves, some breaking changes are occasionally made
> to the API. It's possible to split a plugin into a "common" part and
> multiple version-specific parts and this works OK to have a single
> artifact for users, as long as they write out the fully qualified
> classname as the DataFrame format(). The one part that can't be
> currently worked around is the DataSourceRegister trait. Since classes
> which implement DataSourceRegister must also implement DataSourceV2
> (and its mixins), changes to those interfaces cause the ServiceLoader
> to fail when it attempts to load the "wrong" DataSourceV2 class.
> (there's also an additional prohibition against multiple
> implementations having the same ShortName in
> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
> This means users will need to update their notebooks/code/tutorials if
> they run @ a different site whose cluster is a different version.
>
> To solve this, I proposed in SPARK-31363 a new trait who would
> function the same as the existing DataSourceRegister trait, but adds
> an additional method:
>
> public Class getImplementation();
>
> ...which will allow DSv2 plugins to dynamically choose the appropriate
> DataSourceV2 class based on the runtime environment. This would let us
> release a single artifact for different Spark versions and users could
> use the same artifactID & format regardless of where they were
> executing their code. If there was no services registered with this
> new trait, the functionality would remain the same as before.
>
> I think this functionality will be useful as DSv2 continues to evolve,
> please let me know your thoughts.
>
> Thanks
> Andrew
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-01 Thread Ryan Blue

-1 (non-binding)

I agree with Jungtaek. The change to create datasource tables instead of
Hive tables by default (no USING or STORED AS clauses) has created
confusing behavior and should either be rolled back or fixed before 3.0.

On Wed, Apr 1, 2020 at 5:12 AM Sean Owen  wrote:

> Those are not per se release blockers. They are (perhaps important)
> improvements to functionality. I don't know who is active and able to
> review that part of the code; I'd look for authors of changes in the
> surrounding code. The question here isn't so much what one would like
> to see in this release, but evaluating whether the release is sound
> and free of show-stopper problems. There will always be potentially
> important changes and fixes to come.
>
> On Wed, Apr 1, 2020 at 5:31 AM Dr. Kent Yao  wrote:
> >
> > -1
> > Do not release this package because v3.0.0 is the 3rd major release
> since we
> > added Spark On Kubernetes. Can we make it more production-ready as it has
> > been experimental for more than 2 years?
> >
> > The main practical adoption of Spark on Kubernetes is to take on the
> role of
> > other cluster managers(mainly YARN). And the storage layer(mainly HDFS)
> > would be more likely kept anyway. But Spark on Kubernetes with HDFS seems
> > not to work properly.
> >
> > e.g.
> > This ticket and PR were submitted 7 months ago, and never get reviewed.
> > https://issues.apache.org/jira/browse/SPARK-29974
> > https://issues.apache.org/jira/browse/SPARK-28992
> > https://github.com/apache/spark/pull/25695
> >
> > And this.
> > https://issues.apache.org/jira/browse/SPARK-28896
> > https://github.com/apache/spark/pull/25609
> >
> > In terms of how often this module is updated, it seems to be stable.
> > But in terms of how often PRs for this module are reviewed, it seems
> that it
> > will stay experimental for a long time.
> >
> > Thanks.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >
> > -----
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-25 Thread Ryan Blue

Feel free to open another issue, I just used that one since it describes
this and doesn't appear to be done.

On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim 
wrote:

> UPDATE: Sorry I just missed the PR (
> https://github.com/apache/spark/pull/28026). I still think it'd be nice
> to avoid recycling the JIRA issue which was resolved before. Shall we have
> a new JIRA issue with linking to SPARK-30098, and set proper priority?
>
> On Thu, Mar 26, 2020 at 7:59 AM Jungtaek Lim 
> wrote:
>
>> Would it be better to prioritize this to make sure the change is included
>> in Spark 3.0? (Maybe filing an issue and set as a blocker)
>>
>> Looks like there's consensus that SPARK-30098 brought ambiguous issue
>> which should be fixed (though the consideration of severity seems to be
>> different), and once we notice the issue it would be really odd if we
>> publish it as it is, and try to fix it later (the fix may not be even
>> included in 3.0.x as it might bring behavioral change).
>>
>> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan  wrote:
>>
>>> Hi Ryan,
>>>
>>> It's great to hear that you are cleaning up this long-standing mess.
>>> Please let me know if you hit any problems that I can help with.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan 
>>>> wrote:
>>>>
>>>>> 2. PARTITIONED BY colTypeList: I think we can support it in the
>>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>>> BY transformList.
>>>>>
>>>>
>>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>>> TABLE syntax, we can also update Catalog.createTable() to support
>>>> creating partitioned tables
>>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-25 Thread Ryan Blue

Here's a WIP PR with the basic changes:
https://github.com/apache/spark/pull/28026

I still need to update tests in that branch and add the conversions to the
old Hive plans. But at least you can see how the parser part works and how
I'm converting the extra clauses for DSv2. This also enables us to support
Hive create syntax in DSv2.

On Wed, Mar 25, 2020 at 3:59 PM Jungtaek Lim 
wrote:

> Would it be better to prioritize this to make sure the change is included
> in Spark 3.0? (Maybe filing an issue and set as a blocker)
>
> Looks like there's consensus that SPARK-30098 brought ambiguous issue
> which should be fixed (though the consideration of severity seems to be
> different), and once we notice the issue it would be really odd if we
> publish it as it is, and try to fix it later (the fix may not be even
> included in 3.0.x as it might bring behavioral change).
>
> On Tue, Mar 24, 2020 at 3:37 PM Wenchen Fan  wrote:
>
>> Hi Ryan,
>>
>> It's great to hear that you are cleaning up this long-standing mess.
>> Please let me know if you hit any problems that I can help with.
>>
>> Thanks,
>> Wenchen
>>
>> On Sat, Mar 21, 2020 at 3:16 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> On Thu, Mar 19, 2020 at 3:46 AM Wenchen Fan  wrote:
>>>
>>>> 2. PARTITIONED BY colTypeList: I think we can support it in the
>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>> BY transformList.
>>>>
>>>
>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>> TABLE syntax, we can also update Catalog.createTable() to support
>>> creating partitioned tables
>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Supporting hive on DataSourceV2

2020-03-23 Thread Ryan Blue

Hi Jacky,

We’ve internally released support for Hive tables (and Spark FileFormat
tables) using DataSourceV2 so that we can switch between catalogs; sounds
like that’s what you are planning to build as well. It would be great to
work with the broader community on a Hive connector.

I will get a branch of our connectors published so that you can take a
look. I think it should be fairly close to what you’re talking about
building, with a few exceptions:

   - Our implementation always uses our S3 committers, but it should be
   easy to change this
   - It supports per-partition formats, like Hive

Do you have an idea about where the connector should be developed? I don’t
think it makes sense for it to be part of Spark. That would keep complexity
in the main project and require updating Hive versions slowly. Using a
separate project would mean less code in Spark specific to one source, and
could more easily support multiple Hive versions. Maybe we should create a
project for catalog plug-ins?

rb

On Mon, Mar 23, 2020 at 4:20 AM JackyLee  wrote:

> Hi devs,
> I’d like to start a discussion about Supporting Hive on DatasourceV2. We’re
> now working on a project using DataSourceV2 to provide multiple source
> support and it works with the data lake solution very well, yet it does not
> yet support HiveTable.
>
> There are 3 reasons why we need to support Hive on DataSourceV2.
> 1. Hive itself is one of Spark data sources.
> 2. HiveTable is essentially a FileTable with its own input and output
> formats, it works fine with FileTable.
> 3. HiveTable should be stateless, and users can freely read or write Hive
> using batch or microbatch.
>
> We implemented stateless Hive on DataSourceV1, it supports user to write
> into Hive on streaming or batch and it has widely used in our company.
> Recently, we are trying to support Hive on DataSourceV2, Multiple Hive
> Catalog and DDL Commands have already been supported.
>
> Looking forward to more discussions on this.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Ryan Blue

I have an update to the parser that unifies the CREATE TABLE rules. It took
surprisingly little work to get the parser updated to produce
CreateTableStatement and CreateTableAsSelectStatement with the Hive info.
And the only fields I need to add to those statements were serde: SerdeInfo
and external: Boolean.

>From there, we can use the conversion rules to re-create the same Hive
command for v1 or pass the data as properties for v2. I’ll work on getting
this cleaned up and open a PR hopefully tomorrow.

For the questions about how this gets converted to either a Spark or Hive
create table command, that is really up to analyzer rules and
configuration. With my changes, it is no longer determined by the parser:
the parser just produces a node that includes all of the user options and
Spark decides what to do with that in the analyzer. Also, there's already
an option to convert Hive syntax to a Spark
command, spark.sql.hive.convertCTAS.

rb

On Thu, Mar 19, 2020 at 12:46 AM Wenchen Fan  wrote:

> Big +1 to have one single unified CREATE TABLE syntax.
>
> In general, we can say there are 2 ways to specify the table provider:
> USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually
> exclusive. If none is specified, it implicitly indicates USING
> defaultSource.
>
> I'm fine with a few special cases which can indicate the table provider,
> like EXTERNAL indicates Hive Serde table. A few thoughts:
> 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a
> nice error message. We can support it in the unified syntax as well, and
> fail it.
> 2. PARTITIONED BY colTypeList: I think we can support it in the unified
> syntax. Just make sure it doesn't appear together with PARTITIONED BY
> transformList.
> 3. OPTIONS: We can either map it to Hive Serde properties, or let it
> indicate non-Hive tables.
>
> On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim 
> wrote:
>
>> Thanks Nicholas for the side comment; you'll need to interpret "CREATE
>> TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it
>> may add the confusion.
>>
>> Ryan, thanks for the detailed analysis and proposal. That's what I would
>> like to see in discussion thread.
>>
>> I'm open to solutions which enable end users to specify their intention
>> properly - my main concern of SPARK-30098 is that it becomes unclear which
>> provider the query will use in create table unless USING provider is
>> explicitly specified. If the new proposal makes clear on this, that should
>> be better than now.
>>
>> Replying inline:
>>
>> On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Side comment: The current docs for CREATE TABLE
>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
>>> add to the confusion by describing the Hive-compatible command as "CREATE
>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
>>> actually part of the syntax
>>> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md>
>>> .
>>>
>>> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue 
>>> wrote:
>>>
>>>> Jungtaek, it sounds like you consider the two rules to be separate
>>>> syntaxes with their own consistency rules. For example, if I am using the
>>>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition)
>>>> columns and requires types for those columns; if I’m using the Spark syntax
>>>> rule with USING then PARTITIONED BY must reference existing columns
>>>> and cannot include types.
>>>>
>>>> I agree that this is confusing to users! We should fix it, but I don’t
>>>> think the right solution is to continue to have two rules with divergent
>>>> syntax.
>>>>
>>>> This is confusing to users because they don’t know anything about
>>>> separate parser rules. All the user sees is that sometimes PARTITION BY
>>>> requires types and sometimes it doesn’t. Yes, we could add a keyword,
>>>> HIVE, to signal that the syntax is borrowed from Hive for that case,
>>>> but that actually breaks queries that run in Hive.
>>>>
>>> That might less matter, because SPARK-30098 (and I guess your proposal
>> as well) enforces end users to add "USING HIVE" for their queries to enable
>> Hive provider in any way, even only when the query matches with rule 1
&g

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Ryan Blue

al" one and try 
>>>>>> to
>>>>>> isolate each other? What's the point of providing a legacy config to go
>>>>>> back to the old one even we fear about breaking something to make it 
>>>>>> better
>>>>>> or clearer? We do think that table provider is important (hence the 
>>>>>> change
>>>>>> was done), then is it still a trivial problem whether the provider is
>>>>>> affected by specifying the "optional" fields?
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan 
>>>>>> wrote:
>>>>>>
>>>>>>> I think the general guideline is to promote Spark's own CREATE TABLE
>>>>>>> syntax instead of the Hive one. Previously these two rules are mutually
>>>>>>> exclusive because the native syntax requires the USING clause while the
>>>>>>> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>>>
>>>>>>> It's a good move to make the USING clause optional, which makes it
>>>>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads 
>>>>>>> to
>>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a 
>>>>>>> serious
>>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>>>>> parser rules conflict and we pick the native syntax depending on the 
>>>>>>> rule
>>>>>>> order. But the user-facing behavior looks fine.
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in
>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE 
>>>>>>> syntax?
>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>
>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi devs,
>>>>>>>>
>>>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>>>> SPARK-30098 [1].
>>>>>>>>
>>>>>>>> Previously, "create table" parser rules were clearly distinguished
>>>>>>>> via "USING provider", which was very intuitive and deterministic. Say, 
>>>>>>>> DDL
>>>>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>
>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which 
>>>>>>>> are
>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>> which end users would have no idea to reason about, and very 
>>>>>>>> unintuitive.
>>>>>>>>
>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule
>>>>>>>> to provide better message) brought more confusion (I've described the
>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>
>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>>>>>> trying to document the difference and talk end users to be careful 
>>>>>>>> about
>>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>>>
>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>>>>> "HIVE" TABLE ...`.
>>>>>>>>
>>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>> intend to create Hive table. (Given we will also provide legacy option 
>>>>>>>> I'm
>>>>>>>> feeling this is acceptable.)
>>>>>>>>
>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>
>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>
>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>
>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>> syntax
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>> 2.
>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>> 3.
>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>
>>>>>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [Discuss] Metrics Support for DS V2

2020-01-20 Thread Ryan Blue

I sent them to you. I had to go direct because the ASF mailing list will
remove attachments. I'm happy to send them to others if needed as well.

On Sun, Jan 19, 2020 at 9:01 PM Sandeep Katta <
sandeep0102.opensou...@gmail.com> wrote:

> Please send me the patch , I will apply and test.
>
> On Fri, 17 Jan 2020 at 10:33 PM, Ryan Blue  wrote:
>
>> We've implemented these metrics in the RDD (for input metrics) and in the
>> v2 DataWritingSparkTask. That approach gives you the same metrics in the
>> stage views that you get with v1 sources, regardless of the v2
>> implementation.
>>
>> I'm not sure why they weren't included from the start. It looks like the
>> way metrics are collected is changing. There are a couple of metrics for
>> number of rows; looks like one that goes to the Spark SQL tab and one that
>> is used for the stages view.
>>
>> If you'd like, I can send you a patch.
>>
>> rb
>>
>> On Fri, Jan 17, 2020 at 5:09 AM Wenchen Fan  wrote:
>>
>>> I think there are a few details we need to discuss.
>>>
>>> how frequently a source should update its metrics? For example, if file
>>> source needs to report size metrics per row, it'll be super slow.
>>>
>>> what metrics a source should report? data size? numFiles? read time?
>>>
>>> shall we show metrics in SQL web UI as well?
>>>
>>> On Fri, Jan 17, 2020 at 3:07 PM Sandeep Katta <
>>> sandeep0102.opensou...@gmail.com> wrote:
>>>
>>>> Hi Devs,
>>>>
>>>> Currently DS V2 does not update any input metrics. SPARK-30362 aims at
>>>> solving this problem.
>>>>
>>>> We can have the below approach. Have marker interface let's say
>>>> "ReportMetrics"
>>>>
>>>> If the DataSource Implements this interface, then it will be easy to
>>>> collect the metrics.
>>>>
>>>> For e.g. FilePartitionReaderFactory can support metrics.
>>>>
>>>> So it will be easy to collect the metrics if FilePartitionReaderFactory
>>>> implements ReportMetrics
>>>>
>>>>
>>>> Please let me know the views, or even if we want to have new solution
>>>> or design.
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [Discuss] Metrics Support for DS V2

2020-01-17 Thread Ryan Blue

We've implemented these metrics in the RDD (for input metrics) and in the
v2 DataWritingSparkTask. That approach gives you the same metrics in the
stage views that you get with v1 sources, regardless of the v2
implementation.

I'm not sure why they weren't included from the start. It looks like the
way metrics are collected is changing. There are a couple of metrics for
number of rows; looks like one that goes to the Spark SQL tab and one that
is used for the stages view.

If you'd like, I can send you a patch.

rb

On Fri, Jan 17, 2020 at 5:09 AM Wenchen Fan  wrote:

> I think there are a few details we need to discuss.
>
> how frequently a source should update its metrics? For example, if file
> source needs to report size metrics per row, it'll be super slow.
>
> what metrics a source should report? data size? numFiles? read time?
>
> shall we show metrics in SQL web UI as well?
>
> On Fri, Jan 17, 2020 at 3:07 PM Sandeep Katta <
> sandeep0102.opensou...@gmail.com> wrote:
>
>> Hi Devs,
>>
>> Currently DS V2 does not update any input metrics. SPARK-30362 aims at
>> solving this problem.
>>
>> We can have the below approach. Have marker interface let's say
>> "ReportMetrics"
>>
>> If the DataSource Implements this interface, then it will be easy to
>> collect the metrics.
>>
>> For e.g. FilePartitionReaderFactory can support metrics.
>>
>> So it will be easy to collect the metrics if FilePartitionReaderFactory
>> implements ReportMetrics
>>
>>
>> Please let me know the views, or even if we want to have new solution or
>> design.
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Ryan Blue

g as Transform
>> name, so it's impossible to have well-defined semantic, and also different
>> sources may have different semantic for the same Transform name.
>>
>> I'd suggest we forbid arbitrary string as Transform (the ApplyTransform
>> class). We can even follow DS  V1 Filter and expose the classes directly.
>>
>> On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> I would like to suggest to take one step back at
>>> https://github.com/apache/spark/pull/24117 and rethink about it.
>>> I am writing this email as I raised the issue few times but could not
>>> have enough responses promptly, and
>>> the code freeze is being close.
>>>
>>> In particular, please refer the below comments for the full context:
>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>> - https://github.com/apache/spark/pull/24117#issuecomment-568614961
>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>>
>>>
>>> In short, this PR added an API in DSv2:
>>>
>>> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>>>
>>>
>>> So people can write some classes for *transform(col)* for partitioned
>>> column specifically.
>>>
>>> However, there are some design concerns which looked not addressed
>>> properly.
>>>
>>> Note that one of the main point is to avoid half-baked or
>>> just-work-for-now APIs. However, this looks
>>> definitely like half-completed. Therefore, I would like to propose to
>>> take one step back and revert it for now.
>>> Please see below the concerns listed.
>>>
>>> *Duplication of existing expressions*
>>> Seems like existing expressions are going to be duplicated. See below
>>> new APIs added:
>>>
>>> def years(column: String): YearsTransform = 
>>> YearsTransform(reference(column))
>>> def months(column: String): MonthsTransform = 
>>> MonthsTransform(reference(column))
>>> def days(column: String): DaysTransform = DaysTransform(reference(column))
>>> def hours(column: String): HoursTransform = 
>>> HoursTransform(reference(column))
>>> ...
>>>
>>> It looks like it requires to add a copy of our existing expressions, in
>>> the future.
>>>
>>>
>>> *Limited Extensibility*
>>> It has a clear limitation. It looks other expressions are going to be
>>> allowed together (e.g., `concat(years(col) + days(col))`);
>>> however, it looks impossible to extend with the current design. It just
>>> directly maps transformName to implementation class,
>>> and just pass arguments:
>>>
>>> transform
>>> ...
>>> | transformName=identifier
>>>   '(' argument+=transformArgument (',' argument+=transformArgument)* 
>>> ')'  #applyTransform
>>> ;
>>>
>>> It looks regular expressions are supported; however, it's not.
>>> - If we should support, the design had to consider that.
>>> - if we should not support, different syntax might have to be used
>>> instead.
>>>
>>> *Limited Compatibility Management*
>>> The name can be arbitrary. For instance, if "transform" is supported in
>>> Spark side, the name is preempted by Spark.
>>> If every the datasource supported such name, it becomes not compatible.
>>>
>>>
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

DSv2 sync notes - 11 December 2019

2019-12-19 Thread Ryan Blue

Hi everyone, here are my notes for the DSv2 sync last week. Sorry they’re
late! Feel free to add more details or corrections. Thanks!

rb

*Attendees*:

Ryan Blue
John Zhuge
Dongjoon Hyun
Joseph Torres
Kevin Yu
Russel Spitzer
Terry Kim
Wenchen Fan
Hyukjin Kwan
Jacky Lee

*Topics*:

   - Relation resolution behavior doc:
   
https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit
   - Nested schema pruning for v2 (Dongjoon)
   - TableProvider changes
   - Tasks for Spark 3.0
   - Open PRs
  - Nested schema pruning: https://github.com/apache/spark/pull/26751
  - Support FIRST and AFTER in DDL:
  https://github.com/apache/spark/pull/26817
  - Add VACUUM
   - Spark 3.1 goals (if time)

*Discussion*:

   - User-specified schema handling
  - Burak: User-specified schema should
   - Relation resolution behavior (see doc link above)
  - Ryan: Thanks to Terry for fixing table resolution:
  https://github.com/apache/spark/pull/26684, next step is to clean up
  temp views
  - Terry: the idea is to always resolve identifiers the same way and
  not to resolve a temp view in some cases but not others. If an identifier
  is a temp view and is used in a context where you can’t use a view, it
  should fail instead of finding a table.
  - Ryan: does this need to be done by 3.0?
  - *Consensus was that it should be done for 3.0*
  - Ryan: not much activity on the dev list thread for this. Do we move
  forward anyway?
  - Wenchen: okay to fix because the scope is small
  - *Consensus was to go ahead and notify the dev list about changes*
  because this is a low-risk case that does not occur often (table and temp
  view conflict)
  - Burak: cached tables are similar: for insert you get the new
  results.
  - Ryan: is that the same issue or a similar problem to fix?
  - Burak: similar, it can be done separately
  - Ryan: does this also need to be fixed by 3.0?
  - Wenchen: it is a blocker (Yes). Spark should invalidate the cached
  table after a write
  - Ryan: There’s another issue: how do we handle a permanent view with
  a name that resolves to a temp view? If incorrect, this changes
the results
  of a stored view.
  - Wenchen: This is currently broken, Spark will resolve the relation
  as a temp view. But Spark could use the analysis context to fix this.
  - Ryan: We should fix this when fixing temp views.
   - Nested schema pruning:
  - Dongjoon: Nested schema pruning was only done for Parquet and ORC
  instead of all v2 sources. Anton submitted a PR that fixes it.
  - At the time, the PR had +1s and was pending some minor discussion.
  It was merged the next day.
   -

   TableProvider changes:
   - Wenchen: Spark always calls 1 method to load a table. The
  implementation can do schema and partition inference in that method.
  Forcing this to be separated into other methods causes problems
in the file
  source. FileIndex is used for all these tasks.
  - Ryan: I’m not sure that existing file source code is a good enough
  justification to change the proposed API. Seems too path dependent.
  -

  Ryan: It is also strange to have the source of truth for schema
  information differ between code paths. Some getTable uses would pass the
  schema to the source (from metastore) with TableProvider, but some would
  instead rely on the table from getTable to provide its own
schema. This is
  confusing to implementers.
  -

  Burak: The default mode in DataFrameWriter is ErrorIfExists, which
  doesn’t currently work with v2 sources. Moving from Kafka to KafkaV2, for
  example, would probably break.
  - Ryan: So do we want to get extractCatalog and extractIdentifier
  into 3.0? Or is this blocked by the infer changes?
  - Burak: It would be good to have.
  - Wenchen: Schema may be inferred, or provided by Spark
  - Ryan: Sources should specify whether they accept a user-specified
  schema. But either way, the schema is still external and passed into the
  table. The main decision is whether all cases (inference included) should
  pass the schema into the table.
   - Tasks for 3.0
  - Decided to get temp view resolution fixed
  - Decided to get TableProvider changes in
  - extractCatalog/extractIdentifier are nice-to-have (but small)
  - Burak: Upgrading to v2 saveAsTable from DataFrameWriter v1 creates
  RTAS, but in Delta v1 would only overwrite the schema if requested. It
  would be nice to be able to select
  - Ryan: Standardizing behavior (replace vs truncate vs dynamic
  overwrite) is a main point of v2. Allowing sources to choose their own
  behavior is not supported in v2 so that we can guarantee consistent
  semantics across tables. Making a way for Delta to change its semantics
  doesn’t seem like a good idea

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Ryan Blue

Sounds good to me, too.

On Wed, Dec 11, 2019 at 1:18 AM Jungtaek Lim 
wrote:

> Thanks for the quick response, Wenchen!
>
> I'll leave this thread for early tomorrow so that someone in US timezone
> can chime in, and craft a patch if no one objects.
>
> On Wed, Dec 11, 2019 at 4:41 PM Wenchen Fan  wrote:
>
>> PartitionReader extends Closable, seems reasonable to me to do the same
>> for DataWriter.
>>
>> On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi devs,
>>>
>>> I'd like to propose to add close() on DataWriter explicitly, which is
>>> the place for resource cleanup.
>>>
>>> The rationalization of the proposal is due to the lifecycle of
>>> DataWriter. If the scaladoc of DataWriter is correct, the lifecycle of
>>> DataWriter instance ends at either commit() or abort(). That makes
>>> datasource implementors to feel they can place resource cleanup in both
>>> sides, but abort() can be called when commit() fails; so they have to
>>> ensure they don't do double-cleanup if cleanup is not idempotent.
>>>
>>> I've checked some callers to see whether they can apply
>>> "try-catch-finally" to ensure close() is called at the end of lifecycle for
>>> DataWriter, and they look like so, but I might be missing something.
>>>
>>> What do you think? It would bring backward incompatible change, but
>>> given the interface is marked as Evolving and we're making backward
>>> incompatible changes in Spark 3.0, so I feel it may not matter.
>>>
>>> Would love to hear your thoughts.
>>>
>>> Thanks in advance,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Next DSv2 sync date

2019-12-09 Thread Ryan Blue

Actually, my conflict was cancelled so I'll send out the usual invite for
Wednesday. Sorry for the noise.

On Sun, Dec 8, 2019 at 3:15 PM Ryan Blue  wrote:

> Hi everyone,
>
> I have a conflict with the normal DSv2 sync time this Wednesday and I'd
> like to attend to talk about the TableProvider API.
>
> Would it work for everyone to have the sync at 6PM PST on Tuesday, 10
> December instead? I could also make it at the normal time on Thursday.
>
> Thanks,
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Next DSv2 sync date

2019-12-08 Thread Ryan Blue

Hi everyone,

I have a conflict with the normal DSv2 sync time this Wednesday and I'd
like to attend to talk about the TableProvider API.

Would it work for everyone to have the sync at 6PM PST on Tuesday, 10
December instead? I could also make it at the normal time on Thursday.

Thanks,

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Consistent relation resolution behavior in SparkSQL

2019-12-05 Thread Ryan Blue

+1 for the proposal. The current behavior is confusing.

We also came up with another case that we should consider while
implementing a ViewCatalog: an unresolved relation in a permanent view
(from a view catalog) should never resolve a temporary table. If I have a
view `pview` defined as `select * from t1` with database `db`, then `t1`
should always resolve to `db.t1` and never a temp view `t1`. If it resolves
to the temp view, then temp views can unexpectedly change the behavior of
stored views.

On Wed, Dec 4, 2019 at 7:02 PM Wenchen Fan  wrote:

> +1, I think it's good for both end-users and Spark developers:
> * for end-users, when they lookup a table, they don't need to care which
> command triggers it, as the behavior is consistent in all the places.
> * for Spark developers, we may simplify the code quite a bit. For now we
> have two code paths to lookup tables: one for SELECT/INSERT and one for
> other commands.
>
> Thanks,
> Wenchen
>
> On Mon, Dec 2, 2019 at 9:12 AM Terry Kim  wrote:
>
>> Hi all,
>>
>> As discussed in SPARK-29900, Spark currently has two different relation
>> resolution behaviors:
>>
>>1. Look up temp view first, then table/persistent view
>>2. Look up table/persistent view
>>
>> The first behavior is used in SELECT, INSERT and a few commands that
>> support temp views such as DESCRIBE TABLE, etc. The second behavior is used
>> in most commands. Thus, it is hard to predict which relation resolution
>> rule is being applied for a given command.
>>
>> I want to propose a consistent relation resolution behavior in which temp
>> views are always looked up first before table/persistent view, as
>> described more in detail in this doc: consistent relation resolution
>> proposal
>> <https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing>
>> .
>>
>> Note that this proposal is a breaking change, but the impact should be
>> minimal since this applies only when there are temp views and tables with
>> the same name.
>>
>> Any feedback will be appreciated.
>>
>> I also want to thank Wenchen Fan, Ryan Blue, Burak Yavuz, and Dongjoon
>> Hyun for guidance and suggestion.
>>
>> Regards,
>> Terry
>>
>>
>> <https://issues.apache.org/jira/browse/SPARK-29900>
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Ryan Blue

Just to clarify, I don't think that Parquet 1.10.1 to 1.11.0 is a
runtime-incompatible change. The example mixed 1.11.0 and 1.10.1 in the
same execution.

Michael, please be more careful about announcing compatibility problems in
other communities. If you've observed problems, let's find out the root
cause first.

rb

On Fri, Nov 22, 2019 at 8:56 AM Michael Heuer  wrote:

> Hello,
>
> Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that
> Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on
> dev@parquet
> <https://mail-archives.apache.org/mod_mbox/parquet-dev/201911.mbox/%3c8357699c-9295-4eb0-a39e-b3538d717...@gmail.com%3E>
> ).
>
> Might there be any desire to cut a Spark 2.4.5 release so that users can
> pick up these changes independently of all the other changes in Spark 3.0?
>
> Thank you in advance,
>
>michael
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Enabling fully disaggregated shuffle on Spark

2019-11-19 Thread Ryan Blue

f. If there is substantial overlap between the
>> SortShuffleManager and other implementations, then the storage details can
>> be abstracted at the appropriate level. (SPARK-25299 does not currently
>> change this.)
>>
>>
>> Do not require MapStatus to include blockmanager IDs where they are not
>> relevant. This is captured by ShuffleBlockInfo
>> <https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit#heading=h.imi27prnziyj>
>> including an optional BlockManagerId in SPARK-25299. However, this
>> change should be lifted to the MapStatus level so that it applies to all
>> ShuffleManagers. Alternatively, use a more general data-location
>> abstraction than BlockManagerId. This gives the shuffle manager more
>> flexibility and the scheduler more information with respect to data
>> residence.
>> Serialization
>>
>> Allow serializers to be used more flexibly and efficiently. For example,
>> have serializers support writing an arbitrary number of objects into an
>> existing OutputStream or ByteBuffer. This enables objects to be serialized
>> to direct buffers where doing so makes sense. More importantly, it allows
>> arbitrary metadata/framing data to be wrapped around individual objects
>> cheaply. Right now, that’s only possible at the stream level. (There are
>> hacks around this, but this would enable more idiomatic use in efficient
>> shuffle implementations.)
>>
>>
>> Have serializers indicate whether they are deterministic. This provides
>> much of the value of a shuffle service because it means that reducers do
>> not need to spill to disk when reading/merging/combining inputs--the data
>> can be grouped by the service, even without the service understanding data
>> types or byte representations. Alternative (less preferable since it would
>> break Java serialization, for example): require all serializers to be
>> deterministic.
>>
>>
>>
>> --
>>
>> - Ben
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DSv2 reader lifecycle

2019-11-06 Thread Ryan Blue

Hi Andrew,

This is expected behavior for DSv2 in 2.4. A separate reader is configured
for each operation because the configuration will change. A count, for
example, doesn't need to project any columns, but a count distinct will.
Similarly, if your read has different filters we need to apply those to a
separate reader for each scan.

The newer API that we are releasing in Spark 3.0 addresses the concern that
each reader is independent by using Catalog and Table interfaces. In the
new version, Spark will load a table by name from a persistent catalog
(loaded once) and use the table to create a reader for each operation. That
way, you can load common metadata in the table, cache the table, and pass
its info to readers as they are created.

rb

On Tue, Nov 5, 2019 at 4:58 PM Andrew Melo  wrote:

> Hello,
>
> During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that
> our DataSourceReader is being instantiated multiple times for the same
> dataframe. For example, the following snippet
>
> Dataset df = spark
> .read()
> .format("edu.vanderbilt.accre.laurelin.Root")
> .option("tree",  "Events")
> .load("testdata/pristine/2018nanoaod1june2019.root");
>
> Constructs edu.vanderbilt.accre.laurelin.Root twice and then calls
> createReader once (as an aside, this seems like a lot for 1000 columns?
> "CodeGenerator: Code generated in 8162.847517 ms")
>
> but then running operations on that dataframe (e.g. df.count()) calls
> createReader for each call, instead of holding the existing
> DataSourceReader.
>
> Is that the expected behavior? Because of the file format, it's quite
> expensive to deserialize all the various metadata, so I was holding the
> deserialized version in the DataSourceReader, but if Spark is repeatedly
> constructing new ones, then that doesn't help. If this is the expected
> behavior, how should I handle this as a consumer of the API?
>
> Thanks!
> Andrew
>

-- 
Ryan Blue
Software Engineer
Netflix

DSv2 sync notes - 30 October 2019

2019-11-01 Thread Ryan Blue

*Attendees*:

Ryan Blue
Terry Kim
Wenchen Fan
Jose Torres
Jacky Lee
Gengliang Wang

*Topics*:

   - DROP NAMESPACE cascade behavior
   - 3.0 tasks
   - TableProvider API changes
   - V1 and V2 table resolution rules
   - Separate logical and physical write (for streaming)
   - Bucketing support (if time)
   - Open PRs

*Discussion*:

   - DROP NAMESPACE cascade
  - Terry: How should the cascade option be handled?
  - Ryan: The API currently requires failing when the namespace is
  non-empty; the intent is for Spark to handle the complexity of recursive
  deletes
  - Wenchen: That will be slow because Spark has to list and issue
  individual delete calls.
  - Ryan: What about changing this so that DROP is always a recursive
  drop? Then Spark can check all implemented features (views for
ViewCatalog,
  tables for TableCatalog) and we don’t need to add more calls and args.
  - Consensus was to update dropNamespace so that it is always
  cascading, so implementations can speed up the operation. Spark
will check
  whether a namespace is empty and not issue the call if it is non-empty or
  the query was not cascading.
   - Remaining 3.0 tasks:
  - Add inferSchema and inferPartitioning to TableProvider (#26297)
  - Add catalog and identifier methods so that DataFrameWriter can
  support ErrorIfExists and Ignore modes
   - TableProvider changes:
  - Wenchen: tables need both schema and partitioning. Sometimes these
  are provided but not always. Currently, they are inferred if not
provided,
  but this is implicit based on whether they are passed.
  - Wenchen: A better API is to add inferSchema and inferPartitioning
  that are separate from getTable, so they are always explicitly passed to
  getTable.
  - Wenchen: the only problem is on the write path, where inference is
  not currently done for path-based tables. The PR has a special
case to skip
  inference in this case.
  - Ryan: Sounds okay, will review soon.
  - Ryan: Why is inference so expensive?
  - Wenchen: No validation on write means extra validation is needed to
  read. All file schemas should be used to ensure compatibility.
Partitioning
  is similar: more examples are needed to determine partition column types.
   - Resolution rules
  - Ryan: we found that the v1 and v2 rules are order dependent.
  Wenchen has a PR, but it rewrites the v1 ResolveRelations rule. That’s
  concerning because we don’t want to risk breaking v1 in 3.0. So
we need to
  find a work-around
  - Wenchen: Burak suggested a work-around that should be a good
  approach
  - Ryan: Agreed. And in the long term, I don’t think we want to mix
  view and table resolution. View resolution is complicated because it
  requires context (e.g., current db). But it shouldn’t be necessary to
  resolve tables at the same time. Identifiers can be rewritten to avoid
  this. We should also consider moving view resolution into an
earlier batch.
  In that case, view resolution would happen in a fixed-point batch and it
  wouldn’t need the custom recursive code.
  - Ryan: Can permanent views resolve temporary views? If not, we can
  move temporary views sooner, which would help simplify the v2 resolution
  rules.
   - Separating logical and physical writes
  - Wenchen: there is a use case to add physical information to
  streaming writes, like parallelism. The way streaming is
written, it makes
  sense to separate writes into logical and physical stages, like the read
  side with Scan and Batch.
  - Ryan: So this would create separate Write and Batch objects? Would
  this move epoch ID to the creation of a batch write?
  - Wenchen: maybe. Will write up a design doc. Goal is to get this
  into Spark 3.0 if possible
  - Ryan: Okay, but I think TableProvider is still high priority for
  the 3.0 work
  - Wenchen: Agreed.

-- 
Ryan Blue
Software Engineer
Netflix

Cancel DSv2 sync this week

2019-10-15 Thread Ryan Blue

Hi everyone,

I can't make it to the DSv2 sync tomorrow, so let's skip it. If anyone
would prefer to have one and is willing to take notes, I can send out the
invite. Just let me know, otherwise let's consider it cancelled.

Thanks,

rb

-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync notes - 2 October 2019

2019-10-10 Thread Ryan Blue

Here are my notes from last week's DSv2 sync.

*Attendees*:

Ryan Blue
Terry Kim
Wenchen Fan

*Topics*:

   - SchemaPruning only supports Parquet and ORC?
   - Out of order optimizer rules
   - 3.0 work
  - Rename session catalog to spark_catalog
  - Finish TableProvider update to avoid another API change: pass all
  table config from metastore
  - Catalog behavior fix:
  https://issues.apache.org/jira/browse/SPARK-29014
  - Stats push-down optimization:
  https://github.com/apache/spark/pull/25955
  - DataFrameWriter v1/v2 compatibility progress
   - Open PRs
  - Update identifier resolution and table resolution:
  https://github.com/apache/spark/pull/25747
  - Expose SerializableConfiguration:
  https://github.com/apache/spark/pull/26005
  - Early DSv2 pushdown: https://github.com/apache/spark/pull/25955

*Discussion*:

   - Update identifier and table resolution
  - Wenchen: Will not handle SPARK-29014, it is a pure refactor
  - Ryan: I think this should separate the v2 rules from the v1
  fallback, to keep table and identifier resolution separate. The only time
  that table resolution needs to be done at the same time is for
v1 fallback.
  - This was merged last week
   - Update to use spark_catalog
  - Wenchen: this will be a separate PR.
  - Now open: https://github.com/apache/spark/pull/26071
   - Early DSv2 pushdown
  - Ryan: this depends on fixing a few more tests. To validate there
  are no calls to computeStats with the DSv2 relation, I’ve temporarily
  removed the method. Other than a few remaining test failures
where the old
  relation was expected, it looks like there are no uses of computeStats
  before early pushdown in the optimizer.
  - Wenchen: agreed that the batch was in the correct place in the
  optimizer
  - Ryan: once tests are passing, will add the computeStats
  implementation back with Utils.isTesting to fail during testing
when called
  before early pushdown, but will not fail at runtime
   - Wenchen: when using v2, there is no way to configure custom options
   for a JDBC table. For v1, the table was created and stored in the session
   catalog, at which point Spark-specific properties like parallelism could be
   stored. In v2, the catalog is the source of truth, so tables don’t get
   created in the same way. Options are only passed in a create statement.
  - Ryan: this could be fixed by allowing users to pass options as
  table properties. We mix the two today, but if we used a prefix for table
  properties, “options.”, then you could use SET TBLPROPERTIES to
get around
  this. That’s also better for compatibility. I’ll open a PR for this.
  - Ryan: this could also be solved by adding an OPTIONS clause or hint
  to SELECT
   - Wenchen: There are commands without v2 statements. We should add v2
   statements to reject non-v1 uses.
  - Ryan: Doesn’t the parser only parse up to 2 identifiers for these?
  That would handle the majority of cases
  - Wenchen: Yes, but there is still a problem for identifiers with 1
  part in v2 catalogs, like catalog.table. Commands that don’t support v2
  will use catalog.table in the v1 catalog.
  - Ryan: Sounds like a good plan to update the parser and add
  statements for these. Do we have a list of commands to update?
  - Wenchen: REFRESH TABLE, ANALYZE TABLE, ALTER TABLE PARTITION, etc.
  Will open an umbrella JIRA with a list.

-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Ryan Blue

+1

Thanks for fixing this!

On Thu, Oct 10, 2019 at 6:30 AM Xiao Li  wrote:

> +1
>
> On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon  wrote:
>
>> +1 (binding)
>>
>> 2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이 작성:
>>
>>> Thanks for the great work, Gengliang!
>>>
>>> +1 for that.
>>> As I said before, the behaviour is pretty common in DBMSs, so the change
>>> helps for DMBS users.
>>>
>>> Bests,
>>> Takeshi
>>>
>>>
>>> On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
>>> gengliang.w...@databricks.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'd like to call for a new vote on SPARK-28885
>>>> <https://issues.apache.org/jira/browse/SPARK-28885> "Follow ANSI store
>>>> assignment rules in table insertion by default" after revising the ANSI
>>>> store assignment policy(SPARK-29326
>>>> <https://issues.apache.org/jira/browse/SPARK-29326>).
>>>> When inserting a value into a column with the different data type,
>>>> Spark performs type coercion. Currently, we support 3 policies for the
>>>> store assignment rules: ANSI, legacy and strict, which can be set via the
>>>> option "spark.sql.storeAssignmentPolicy":
>>>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>>>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>>>> certain unreasonable type conversions such as converting `string` to `int`
>>>> and `double` to `boolean`. It will throw a runtime exception if the value
>>>> is out-of-range(overflow).
>>>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>>>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>>>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>>>> for compatibility with Hive. When inserting an out-of-range value to an
>>>> integral field, the low-order bits of the value is inserted(the same as
>>>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>>>> field of Byte type, the result is 1.
>>>> 3. Strict: Spark doesn't allow any possible precision loss or data
>>>> truncation in store assignment, e.g., converting either `double` to `int`
>>>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>>>> encoder. As far as I know, no mainstream DBMS is using this policy by
>>>> default.
>>>>
>>>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>>>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>>>> and V2 in Spark 3.0.
>>>>
>>>> This vote is open until Friday (Oct. 11).
>>>>
>>>> [ ] +1: Accept the proposal
>>>> [ ] +0
>>>> [ ] -1: I don't think this is a good idea because ...
>>>>
>>>> Thank you!
>>>>
>>>> Gengliang
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Ryan Blue

Thanks for the pointers, but what I'm looking for is information about the
design of this implementation, like what requires this to be in spark-sql
instead of spark-catalyst.

Even a high-level description, like what the optimizer rules are and what
they do would be great. Was there one written up internally that you could
share?

On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue 
wrote:

> > It lists 3 cases for how a filter is built, but nothing about the
> overall approach or design that helps when trying to find out where it
> should be placed in the optimizer rules.
>
> The overall idea/design of DPP can be simply put as using the result of
> one side of the join to prune partitions of a scan on the other side. The
> optimal situation is when the join is a broadcast join and the table being
> partition-pruned is on the probe side. In that case, by the time the probe
> side starts, the filter will already have the results available and ready
> for reuse.
>
> Regarding the place in the optimizer rules, it's preferred to happen late
> in the optimization, and definitely after join reorder.
>
>
> Thanks,
> Maryann
>
> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin  wrote:
>
>> Whoever created the JIRA years ago didn't describe dpp correctly, but the
>> linked jira in Hive was correct (which unfortunately is much more terse
>> than any of the patches we have in Spark
>> https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description
>> was also correct.
>>
>>
>>
>>
>>
>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue 
>> wrote:
>>
>>> Where can I find a design doc for dynamic partition pruning that
>>> explains how it works?
>>>
>>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
>>> pruning (as pointed out by Henry R.) and doesn't have any comments about
>>> the implementation's approach. And the PR description also doesn't have
>>> much information. It lists 3 cases for how a filter is built, but
>>> nothing about the overall approach or design that helps when trying to find
>>> out where it should be placed in the optimizer rules. It also isn't clear
>>> why this couldn't be part of spark-catalyst.
>>>
>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan  wrote:
>>>
>>>> dynamic partition pruning rule generates "hidden" filters that will be
>>>> converted to real predicates at runtime, so it doesn't matter where we run
>>>> the rule.
>>>>
>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's
>>>> better to run it before join reorder.
>>>>
>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue 
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I have been working on a PR that moves filter and projection pushdown
>>>>> into the optimizer for DSv2, instead of when converting to physical plan.
>>>>> This will make DSv2 work with optimizer rules that depend on stats, like
>>>>> join reordering.
>>>>>
>>>>> While adding the optimizer rule, I found that some rules appear to be
>>>>> out of order. For example, PruneFileSourcePartitions that handles
>>>>> filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a
>>>>> batch that will run after all of the batches in Optimizer
>>>>> (spark-catalyst) including CostBasedJoinReorder.
>>>>>
>>>>> SparkOptimizer also adds the new “dynamic partition pruning” rules
>>>>> *after* both the cost-based join reordering and the v1 partition
>>>>> pruning rule. I’m not sure why this should run after join reordering and
>>>>> partition pruning, since it seems to me like additional filters would be
>>>>> good to have before those rules run.
>>>>>
>>>>> It looks like this might just be that the rules were written in the
>>>>> spark-sql module instead of in catalyst. That makes some sense for the v1
>>>>> pushdown, which is altering physical plan details (FileIndex) that
>>>>> have leaked into the logical plan. I’m not sure why the dynamic partition
>>>>> pruning rules aren’t in catalyst or why they run after the v1 predicate
>>>>> pushdown.
>>>>>
>>>>> Can someone more familiar with these rules clarify why they appear to
>>>>> be out of order?
>>>>>
>>>>> Assuming that this is an accident, I think it’s something that should
>>>>> be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning
>>>>> may still need to be addressed.
>>>>>
>>>>> rb
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Ryan Blue

Where can I find a design doc for dynamic partition pruning that explains
how it works?

The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
pruning (as pointed out by Henry R.) and doesn't have any comments about
the implementation's approach. And the PR description also doesn't have
much information. It lists 3 cases for how a filter is built, but
nothing about the overall approach or design that helps when trying to find
out where it should be placed in the optimizer rules. It also isn't clear
why this couldn't be part of spark-catalyst.

On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan  wrote:

> dynamic partition pruning rule generates "hidden" filters that will be
> converted to real predicates at runtime, so it doesn't matter where we run
> the rule.
>
> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's better
> to run it before join reorder.
>
> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue 
> wrote:
>
>> Hi everyone,
>>
>> I have been working on a PR that moves filter and projection pushdown
>> into the optimizer for DSv2, instead of when converting to physical plan.
>> This will make DSv2 work with optimizer rules that depend on stats, like
>> join reordering.
>>
>> While adding the optimizer rule, I found that some rules appear to be out
>> of order. For example, PruneFileSourcePartitions that handles filter
>> pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that
>> will run after all of the batches in Optimizer (spark-catalyst)
>> including CostBasedJoinReorder.
>>
>> SparkOptimizer also adds the new “dynamic partition pruning” rules
>> *after* both the cost-based join reordering and the v1 partition pruning
>> rule. I’m not sure why this should run after join reordering and partition
>> pruning, since it seems to me like additional filters would be good to have
>> before those rules run.
>>
>> It looks like this might just be that the rules were written in the
>> spark-sql module instead of in catalyst. That makes some sense for the v1
>> pushdown, which is altering physical plan details (FileIndex) that have
>> leaked into the logical plan. I’m not sure why the dynamic partition
>> pruning rules aren’t in catalyst or why they run after the v1 predicate
>> pushdown.
>>
>> Can someone more familiar with these rules clarify why they appear to be
>> out of order?
>>
>> Assuming that this is an accident, I think it’s something that should be
>> fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may
>> still need to be addressed.
>>
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

[DISCUSS] Out of order optimizer rules?

2019-09-28 Thread Ryan Blue

Hi everyone,

I have been working on a PR that moves filter and projection pushdown into
the optimizer for DSv2, instead of when converting to physical plan. This
will make DSv2 work with optimizer rules that depend on stats, like join
reordering.

While adding the optimizer rule, I found that some rules appear to be out
of order. For example, PruneFileSourcePartitions that handles filter
pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will
run after all of the batches in Optimizer (spark-catalyst) including
CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules *after*
both the cost-based join reordering and the v1 partition pruning rule. I’m
not sure why this should run after join reordering and partition pruning,
since it seems to me like additional filters would be good to have before
those rules run.

It looks like this might just be that the rules were written in the
spark-sql module instead of in catalyst. That makes some sense for the v1
pushdown, which is altering physical plan details (FileIndex) that have
leaked into the logical plan. I’m not sure why the dynamic partition
pruning rules aren’t in catalyst or why they run after the v1 predicate
pushdown.

Can someone more familiar with these rules clarify why they appear to be
out of order?

Assuming that this is an accident, I think it’s something that should be
fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may
still need to be addressed.

rb
-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

2019-09-24 Thread Ryan Blue

> That's not a new requirement, that's an "implicit" requirement via
semantic versioning.

The expectation is that the DSv2 API will change in minor versions in the
2.x line. The API is marked with the Experimental API annotation to signal
that it can change, and it has been changing.

A requirement to not change this API for a 2.5 release is a new
requirement. I'm fine with that if that's what everyone wants. Like I said,
if we want to add a requirement to not change this API then we shouldn't
release the 2.5 that I'm proposing.

On Tue, Sep 24, 2019 at 2:51 PM Jungtaek Lim  wrote:

> >> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.
>
> > This has not been a requirement for DSv2 development so far. If this is
> a new requirement, then we should not do a 2.5 release.
>
> My 2 cents, target version of new DSv2 has been only 3.0 so we don't ever
> have a chance to think about such requirement - that's why there's no
> restriction on breaking compatibility on codebase. That's not a new
> requirement, that's an "implicit" requirement via semantic versioning. I
> agree that some of APIs have been changed between Spark 2.x versions, but I
> guess the changes in "new" DSv2 would be bigger than summation of changes
> on "old" DSv2 which has been introduced across multiple minor versions.
>
> Suppose we're developers of Spark ecosystem maintaining custom data source
> (forget about developing Spark): I would get some official announcement on
> next minor version, and I want to try it out quickly to see my stuff still
> supports new version. When I change the dependency version everything will
> break. My hopeful expectation would be no issue while upgrading but turns
> out it's not, and even it requires new learning (not only fixing
> compilation failures). It would just make me giving up support Spark 2.5 or
> at least I won't follow up such change quickly. IMHO 3.0-techpreview has
> advantage here (assuming we provide maven artifacts as well as official
> announcement), as it can give us expectation that there're bunch of changes
> given it's a new major version. It also provides bunch of time to try
> adopting it before the version is officially released.
>
>
> On Wed, Sep 25, 2019 at 4:56 AM Ryan Blue  wrote:
>
>> From those questions, I can see that there is significant confusion about
>> what I'm proposing, so let me try to clear it up.
>>
>> > 1. Is DSv2 stable in `master`?
>>
>> DSv2 has reached a stable API that is capable of supporting all of the
>> features we intend to deliver for Spark 3.0. The proposal is to backport
>> the same API and features for Spark 2.5.
>>
>> I am not saying that this API won't change after 3.0. Notably, Reynold
>> wants to change the use of InternalRow. But, these changes are after 3.0
>> and don't affect the compatibility I'm proposing, between the 2.5 and 3.0
>> releases. I also doubt that breaking changes would happen by 3.1.
>>
>> > 2. If then, what subset of DSv2 patches does Ryan is suggesting
>> backporting?
>>
>> I am proposing backporting what we intend to deliver for 3.0: the API
>> currently in master, SQL support, and multi-catalog support.
>>
>> > 3. How much those backporting DSv2 patches looks differently in
>> `branch-2.4`?
>>
>> DSv2 is mostly an addition located in the `connector` package. It also
>> changes some parts of the SQL parser and adds parsed plans, as well as new
>> rules to convert from parsed plans. This is not an invasive change because
>> we kept most of DSv2 separate. DSv2 should be nearly identical between the
>> two branches.
>>
>> > 4. What does he mean by `without breaking changes? Is it technically
>> feasible?
>>
>> DSv2 is marked unstable in the 2.x line and changes between releases. The
>> API changed between 2.3 and 2.4, so this would be no different. But, we
>> would keep the API the same between 2.5 and 3.0 to assist migration.
>>
>> This is technically feasible because what we are planning to deliver for
>> 3.0 is nearly ready, and the API has not needed to change recently.
>>
>> > Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.
>>
>> This has not been a requirement for DSv2 development so far. If this is a
>> new requirement, then we should not do a 2.5 release.
>>
>> > 5. How long does it take? Is it possible before 3.0.0-preview? Who will
>> work on that backporting?
>>
>> As I said, I'm already going to do this work, so I'm offering to release
>> it to the community. I don't know how long it will take, but this work and
>> 3.0-preview are not mutually exclusi

Re: [DISCUSS] Spark 2.5 release

2019-09-24 Thread Ryan Blue

>From those questions, I can see that there is significant confusion about
what I'm proposing, so let me try to clear it up.

> 1. Is DSv2 stable in `master`?

DSv2 has reached a stable API that is capable of supporting all of the
features we intend to deliver for Spark 3.0. The proposal is to backport
the same API and features for Spark 2.5.

I am not saying that this API won't change after 3.0. Notably, Reynold
wants to change the use of InternalRow. But, these changes are after 3.0
and don't affect the compatibility I'm proposing, between the 2.5 and 3.0
releases. I also doubt that breaking changes would happen by 3.1.

> 2. If then, what subset of DSv2 patches does Ryan is suggesting
backporting?

I am proposing backporting what we intend to deliver for 3.0: the API
currently in master, SQL support, and multi-catalog support.

> 3. How much those backporting DSv2 patches looks differently in
`branch-2.4`?

DSv2 is mostly an addition located in the `connector` package. It also
changes some parts of the SQL parser and adds parsed plans, as well as new
rules to convert from parsed plans. This is not an invasive change because
we kept most of DSv2 separate. DSv2 should be nearly identical between the
two branches.

> 4. What does he mean by `without breaking changes? Is it technically
feasible?

DSv2 is marked unstable in the 2.x line and changes between releases. The
API changed between 2.3 and 2.4, so this would be no different. But, we
would keep the API the same between 2.5 and 3.0 to assist migration.

This is technically feasible because what we are planning to deliver for
3.0 is nearly ready, and the API has not needed to change recently.

> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible.

This has not been a requirement for DSv2 development so far. If this is a
new requirement, then we should not do a 2.5 release.

> 5. How long does it take? Is it possible before 3.0.0-preview? Who will
work on that backporting?

As I said, I'm already going to do this work, so I'm offering to release it
to the community. I don't know how long it will take, but this work and
3.0-preview are not mutually exclusive.

> 6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
2020 Summer)?

It is useful to me, so I assume it is useful to others.

I also think it is unlikely that 3.1 will need to make API changes to DSv2.
There may be some bugs found, but I don't think we will break API
compatibility so quickly. Most of the changes to the API will require only
additions.

> If you have a working branch, please share with us.

I don't have a branch to share.

On Mon, Sep 23, 2019 at 6:47 PM Dongjoon Hyun 
wrote:

> Hi, Ryan.
>
> This thread has many replied as you see. That is the evidence that the
> community is interested in your suggestion a lot.
>
> > I'm offering to help build a stable release without breaking changes.
> But if there is no community interest in it, I'm happy to drop this.
>
> In this thread, the root cause of the disagreement is due to the lack of
> supporting evidence for your claims.
>
> 1. Is DSv2 stable in `master`?
> 2. If then, what subset of DSv2 patches does Ryan is suggesting
> backporting?
> 3. How much those backporting DSv2 patches looks differently in
> `branch-2.4`?
> 4. What does he mean by `without breaking changes? Is it technically
> feasible?
> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. (Not between
> 2.5.x DSv2 and 3.0.0 DSv2)
> 5. How long does it take? Is it possible before 3.0.0-preview? Who will
> work on that backporting?
> 6. Is this meaningful if 2.5 and 3.1 become different again too soon (in
> 2020 Summer)?
>
> We are SW engineers.
> If you have a working branch, please share with us.
> It will help us understand your suggestion and this discussion.
> We can help you verify that branch achieves your goal.
> The branch is tested already, isn't it?
>
> Bests,
> Dongjoon.
>
>
>
>
> On Mon, Sep 23, 2019 at 10:44 AM Holden Karau 
> wrote:
>
>> I would personally love to see us provide a gentle migration path to
>> Spark 3 especially if much of the work is already going to happen anyways.
>>
>> Maybe giving it a different name (eg something like
>> Spark-2-to-3-transitional) would make it more clear about its intended
>> purpose and encourage folks to move to 3 when they can?
>>
>> On Mon, Sep 23, 2019 at 9:17 AM Ryan Blue 
>> wrote:
>>
>>> My understanding is that 3.0-preview is not going to be a
>>> production-ready release. For those of us that have been using backports of
>>> DSv2 in production, that doesn't help.
>>>
>>> It also doesn't help as a stepping stone because users would need to
>>> handle all of the incompatible changes in 3.0. Using 3.0-previ

Re: [DISCUSS] Spark 2.5 release

2019-09-23 Thread Ryan Blue

My understanding is that 3.0-preview is not going to be a production-ready
release. For those of us that have been using backports of DSv2 in
production, that doesn't help.

It also doesn't help as a stepping stone because users would need to handle
all of the incompatible changes in 3.0. Using 3.0-preview would be an
unstable release with breaking changes instead of a stable release without
the breaking changes.

I'm offering to help build a stable release without breaking changes. But
if there is no community interest in it, I'm happy to drop this.

On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon  wrote:

> +1 for Matei's as well.
>
> On Sun, 22 Sep 2019, 14:59 Marco Gaido,  wrote:
>
>> I agree with Matei too.
>>
>> Thanks,
>> Marco
>>
>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>> dongjoon.h...@gmail.com> ha scritto:
>>
>>> +1 for Matei's suggestion!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia 
>>> wrote:
>>>
>>>> If the goal is to get people to try the DSv2 API and build DSv2 data
>>>> sources, can we recommend the 3.0-preview release for this? That would get
>>>> people shifting to 3.0 faster, which is probably better overall compared to
>>>> maintaining two major versions. There’s not that much else changing in 3.0
>>>> if you already want to update your Java version.
>>>>
>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue 
>>>> wrote:
>>>>
>>>> > If you insist we shouldn't change the unstable temporary API in 3.x .
>>>> . .
>>>>
>>>> Not what I'm saying at all. I said we should carefully consider whether
>>>> a breaking change is the right decision in the 3.x line.
>>>>
>>>> All I'm suggesting is that we can make a 2.5 release with the feature
>>>> and an API that is the same as the one in 3.0.
>>>>
>>>> > I also don't get this backporting a giant feature to 2.x line
>>>>
>>>> I am planning to do this so we can use DSv2 before 3.0 is released.
>>>> Then we can have a source implementation that works in both 2.x and 3.0 to
>>>> make the transition easier. Since I'm already doing the work, I'm offering
>>>> to share it with the community.
>>>>
>>>>
>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin 
>>>> wrote:
>>>>
>>>>> Because for example we'd need to move the location of InternalRow,
>>>>> breaking the package name. If you insist we shouldn't change the unstable
>>>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>>>> different from my understanding of the situation when you exposed it, then
>>>>> I'd say we should gate 3.0 on having a stable row interface.
>>>>>
>>>>> I also don't get this backporting a giant feature to 2.x line ... as
>>>>> suggested by others in the thread, DSv2 would be one of the main reasons
>>>>> people upgrade to 3.0. What's so special about DSv2 that we are doing 
>>>>> this?
>>>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue  wrote:
>>>>>
>>>>>> Why would that require an incompatible change?
>>>>>>
>>>>>> We *could* make an incompatible change and remove support for
>>>>>> InternalRow, but I think we would want to carefully consider whether that
>>>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>>>> 3.0 compatible, which is the main goal.
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin 
>>>>>> wrote:
>>>>>>
>>>>>> How would you not make incompatible changes in 3.x? As discussed the
>>>>>> InternalRow API is not stable and needs to change.
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:
>>>>>>
>>>>>> > Making downstream to diverge their implementation heavily between
>>>>>> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>>>
>>>>>> You're right that the API has been evolving in the 2.x line. But, it
>>>>>> is now reasonably stable with respect to the current

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue

> If you insist we shouldn't change the unstable temporary API in 3.x . . .

Not what I'm saying at all. I said we should carefully consider whether a
breaking change is the right decision in the 3.x line.

All I'm suggesting is that we can make a 2.5 release with the feature and
an API that is the same as the one in 3.0.

> I also don't get this backporting a giant feature to 2.x line

I am planning to do this so we can use DSv2 before 3.0 is released. Then we
can have a source implementation that works in both 2.x and 3.0 to make the
transition easier. Since I'm already doing the work, I'm offering to share
it with the community.


On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin  wrote:

> Because for example we'd need to move the location of InternalRow,
> breaking the package name. If you insist we shouldn't change the unstable
> temporary API in 3.x to maintain compatibility with 3.0, which is totally
> different from my understanding of the situation when you exposed it, then
> I'd say we should gate 3.0 on having a stable row interface.
>
> I also don't get this backporting a giant feature to 2.x line ... as
> suggested by others in the thread, DSv2 would be one of the main reasons
> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>
>
>
> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue  wrote:
>
>> Why would that require an incompatible change?
>>
>> We *could* make an incompatible change and remove support for
>> InternalRow, but I think we would want to carefully consider whether that
>> is the right decision. And in any case, we would be able to keep 2.5 and
>> 3.0 compatible, which is the main goal.
>>
>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin  wrote:
>>
>> How would you not make incompatible changes in 3.x? As discussed the
>> InternalRow API is not stable and needs to change.
>>
>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:
>>
>> > Making downstream to diverge their implementation heavily between minor
>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>
>> You're right that the API has been evolving in the 2.x line. But, it is
>> now reasonably stable with respect to the current feature set and we should
>> not need to break compatibility in the 3.x line. Because we have reached
>> our goals for the 3.0 release, we can backport at least those features to
>> 2.x and confidently have an API that works in both a 2.x release and is
>> compatible with 3.0, if not 3.1 and later releases as well.
>>
>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>> 3.0 is officially released
>>
>> The reason I'm suggesting this is that I'm already going to do the work
>> to backport the 3.0 release features to 2.4. I've been asked by several
>> people when DSv2 will be released, so I know there is a lot of interest in
>> making this available sooner than 3.0. If I'm already doing the work, then
>> I'd be happy to share that with the community.
>>
>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>> about complete so we can easily release the same set of features and API in
>> 2.5 and 3.0.
>>
>> If we decide for some reason to wait until after 3.0 is released, I don't
>> know that there is much value in a 2.5. The purpose is to be a step toward
>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>> probably would, given the work needed to validate the incompatible changes
>> in 3.0.
>>
>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>> users may hesitate to upgrade
>>
>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>> expected. I don't think it will need incompatible changes in the 3.x line.
>>
>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:
>>
>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>> deal with this as the change made confusion on my PRs...), but my bet is
>> that DSv2 would be already changed in incompatible way, at least who works
>> for custom DataSource. Making downstream to diverge their implementation
>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>> experience - especially we are not completely closed the chance to further
>> modify DSv2, and the change could be backward incompatible.
>>
>> If we really want to bring the DSv2 ch

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue

Why would that require an incompatible change?

We *could* make an incompatible change and remove support for InternalRow,
but I think we would want to carefully consider whether that is the right
decision. And in any case, we would be able to keep 2.5 and 3.0 compatible,
which is the main goal.

On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin  wrote:

> How would you not make incompatible changes in 3.x? As discussed the
> InternalRow API is not stable and needs to change.
>
> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:
>
>> > Making downstream to diverge their implementation heavily between minor
>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>
>> You're right that the API has been evolving in the 2.x line. But, it is
>> now reasonably stable with respect to the current feature set and we should
>> not need to break compatibility in the 3.x line. Because we have reached
>> our goals for the 3.0 release, we can backport at least those features to
>> 2.x and confidently have an API that works in both a 2.x release and is
>> compatible with 3.0, if not 3.1 and later releases as well.
>>
>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>> 3.0 is officially released
>>
>> The reason I'm suggesting this is that I'm already going to do the work
>> to backport the 3.0 release features to 2.4. I've been asked by several
>> people when DSv2 will be released, so I know there is a lot of interest in
>> making this available sooner than 3.0. If I'm already doing the work, then
>> I'd be happy to share that with the community.
>>
>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>> about complete so we can easily release the same set of features and API in
>> 2.5 and 3.0.
>>
>> If we decide for some reason to wait until after 3.0 is released, I don't
>> know that there is much value in a 2.5. The purpose is to be a step toward
>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>> probably would, given the work needed to validate the incompatible changes
>> in 3.0.
>>
>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>> users may hesitate to upgrade
>>
>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>> expected. I don't think it will need incompatible changes in the 3.x line.
>>
>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:
>>
>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>> deal with this as the change made confusion on my PRs...), but my bet is
>>> that DSv2 would be already changed in incompatible way, at least who works
>>> for custom DataSource. Making downstream to diverge their implementation
>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>> experience - especially we are not completely closed the chance to further
>>> modify DSv2, and the change could be backward incompatible.
>>>
>>> If we really want to bring the DSv2 change to 2.x version line to let
>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather
>>> say preparation of Spark 2.5 should be started after Spark 3.0 is
>>> officially released, honestly even later than that, say, getting some
>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>> be frustrated to upgrade to next minor version.
>>>
>>> Btw, do we have any specific target users for this? Personally DSv2
>>> change would be the major backward incompatibility which Spark 2.x users
>>> may hesitate to upgrade, so they might be already prepared to migrate to
>>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>>
>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>>> I believe we follow Semantic Versioning (
>>>> https://spark.apache.org/versioning-policy.html ).
>>>>
>>>> > We just won’t add any breaking changes before 3.1.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
>>>> wrote:
>>>>
>>>>> I don’t think we need to gate a 3.0

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue

> Making downstream to diverge their implementation heavily between minor
versions (say, 2.4 vs 2.5) wouldn't be a good experience

You're right that the API has been evolving in the 2.x line. But, it is now
reasonably stable with respect to the current feature set and we should not
need to break compatibility in the 3.x line. Because we have reached our
goals for the 3.0 release, we can backport at least those features to 2.x
and confidently have an API that works in both a 2.x release and is
compatible with 3.0, if not 3.1 and later releases as well.

> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0
is officially released

The reason I'm suggesting this is that I'm already going to do the work to
backport the 3.0 release features to 2.4. I've been asked by several people
when DSv2 will be released, so I know there is a lot of interest in making
this available sooner than 3.0. If I'm already doing the work, then I'd be
happy to share that with the community.

I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
about complete so we can easily release the same set of features and API in
2.5 and 3.0.

If we decide for some reason to wait until after 3.0 is released, I don't
know that there is much value in a 2.5. The purpose is to be a step toward
3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
wouldn't get these features out any sooner than 3.0, as a 2.5 release
probably would, given the work needed to validate the incompatible changes
in 3.0.

> DSv2 change would be the major backward incompatibility which Spark 2.x
users may hesitate to upgrade

As I pointed out, DSv2 has been changing in the 2.x line, so this is
expected. I don't think it will need incompatible changes in the 3.x line.

On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:

> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
> deal with this as the change made confusion on my PRs...), but my bet is
> that DSv2 would be already changed in incompatible way, at least who works
> for custom DataSource. Making downstream to diverge their implementation
> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
> experience - especially we are not completely closed the chance to further
> modify DSv2, and the change could be backward incompatible.
>
> If we really want to bring the DSv2 change to 2.x version line to let end
> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
> preparation of Spark 2.5 should be started after Spark 3.0 is officially
> released, honestly even later than that, say, getting some reports from
> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
> upgrade to next minor version.
>
> Btw, do we have any specific target users for this? Personally DSv2 change
> would be the major backward incompatibility which Spark 2.x users may
> hesitate to upgrade, so they might be already prepared to migrate to Spark
> 3.0 if they are prepared to migrate to new DSv2.
>
> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun 
> wrote:
>
>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>> I believe we follow Semantic Versioning (
>> https://spark.apache.org/versioning-policy.html ).
>>
>> > We just won’t add any breaking changes before 3.1.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
>> wrote:
>>
>>> I don’t think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow
>>>
>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>> problems with it.
>>>
>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>> seems like a false premise.
>>>
>>> Why do you think we will need to break certain APIs before 3.0?
>>>
>>> I’m only suggesting that we release the same support in a 2.5 release
>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>> seems like we can certainly do that. We just won’t add any breaking changes
>>> before 3.1.
>>>
>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin 
>>> wrote:
>>>
>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>> (which will change and progress towards more stable, but will have to break
>>>>

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue

Thanks for pointing this out, Dongjoon.

To clarify, I’m not suggesting that we can break compatibility. I’m
suggesting that we make a 2.5 release that uses the same DSv2 API as 3.0.

These APIs are marked unstable, so we could make changes to them if we
needed — as we have done in the 2.x line — but I don’t see a reason why we
would break compatibility in the 3.x line.

On Fri, Sep 20, 2019 at 8:46 PM Dongjoon Hyun 
wrote:

> Do you mean you want to have a breaking API change between 3.0 and 3.1?
> I believe we follow Semantic Versioning (
> https://spark.apache.org/versioning-policy.html ).
>
> > We just won’t add any breaking changes before 3.1.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
> wrote:
>
>> I don’t think we need to gate a 3.0 release on making a more stable
>> version of InternalRow
>>
>> Sounds like we agree, then. We will use it for 3.0, but there are known
>> problems with it.
>>
>> Thinking we’d have dsv2 working in both 3.x (which will change and
>> progress towards more stable, but will have to break certain APIs) and 2.x
>> seems like a false premise.
>>
>> Why do you think we will need to break certain APIs before 3.0?
>>
>> I’m only suggesting that we release the same support in a 2.5 release
>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>> seems like we can certainly do that. We just won’t add any breaking changes
>> before 3.1.
>>
>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin  wrote:
>>
>>> I don't think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>> (which will change and progress towards more stable, but will have to break
>>> certain APIs) and 2.x seems like a false premise.
>>>
>>> To point out some problems with InternalRow that you think are already
>>> pragmatic and stable:
>>>
>>> The class is in catalyst, which states:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>
>>> /**
>>> * Catalyst is a library for manipulating relational query plans.  All
>>> classes in catalyst are
>>> * considered an internal API to Spark SQL and are subject to change
>>> between minor releases.
>>> */
>>>
>>> There is no even any annotation on the interface.
>>>
>>> The entire dependency chain were created to be private, and tightly
>>> coupled with internal implementations. For example,
>>>
>>>
>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>
>>> /**
>>> * A UTF-8 String for internal Spark use.
>>> * 
>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>> comparison,
>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>> * 
>>> * Note: This is not designed for general use cases, should not be used
>>> outside SQL.
>>> */
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>
>>> (which again is in catalyst package)
>>>
>>>
>>> If you want to argue this way, you might as well argue we should make
>>> the entire catalyst package public to be pragmatic and not allow any
>>> changes.
>>>
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue  wrote:
>>>
>>>> When you created the PR to make InternalRow public
>>>>
>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>>>> Exposing this API has always been a part of DSv2 and while both you and I
>>>> did some work to avoid this, we are still in the phase of starting with
>>>> that API.
>>>>
>>>> Note that any change to InternalRow would be very costly to implement
>>>> because this interface is widely used. That is why I think we can certainly
>>>> consider it stable enough to use here, and that’s probably why
>>>> UnsafeRow was part of the original proposal.
>>>>
>>>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>>>> it was to get the majority of SQL working on top of the interface added
>>>>

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue

I don’t think we need to gate a 3.0 release on making a more stable version
of InternalRow

Sounds like we agree, then. We will use it for 3.0, but there are known
problems with it.

Thinking we’d have dsv2 working in both 3.x (which will change and progress
towards more stable, but will have to break certain APIs) and 2.x seems
like a false premise.

Why do you think we will need to break certain APIs before 3.0?

I’m only suggesting that we release the same support in a 2.5 release that
we do in 3.0. Since we are nearly finished with the 3.0 goals, it seems
like we can certainly do that. We just won’t add any breaking changes
before 3.1.

On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin  wrote:

> I don't think we need to gate a 3.0 release on making a more stable
> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
> (which will change and progress towards more stable, but will have to break
> certain APIs) and 2.x seems like a false premise.
>
> To point out some problems with InternalRow that you think are already
> pragmatic and stable:
>
> The class is in catalyst, which states:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>
> /**
> * Catalyst is a library for manipulating relational query plans.  All
> classes in catalyst are
> * considered an internal API to Spark SQL and are subject to change
> between minor releases.
> */
>
> There is no even any annotation on the interface.
>
> The entire dependency chain were created to be private, and tightly
> coupled with internal implementations. For example,
>
>
> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>
> /**
> * A UTF-8 String for internal Spark use.
> * 
> * A String encoded in UTF-8 as an Array[Byte], which can be used for
> comparison,
> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
> * 
> * Note: This is not designed for general use cases, should not be used
> outside SQL.
> */
>
>
>
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>
> (which again is in catalyst package)
>
>
> If you want to argue this way, you might as well argue we should make the
> entire catalyst package public to be pragmatic and not allow any changes.
>
>
>
>
> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue  wrote:
>
>> When you created the PR to make InternalRow public
>>
>> This isn’t quite accurate. The change I made was to use InternalRow
>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>> Exposing this API has always been a part of DSv2 and while both you and I
>> did some work to avoid this, we are still in the phase of starting with
>> that API.
>>
>> Note that any change to InternalRow would be very costly to implement
>> because this interface is widely used. That is why I think we can certainly
>> consider it stable enough to use here, and that’s probably why UnsafeRow
>> was part of the original proposal.
>>
>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>> it was to get the majority of SQL working on top of the interface added
>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>> reasonable.
>>
>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin  wrote:
>>
>> To push back, while I agree we should not drastically change
>> "InternalRow", there are a lot of changes that need to happen to make it
>> stable. For example, none of the publicly exposed interfaces should be in
>> the Catalyst package or the unsafe package. External implementations should
>> be decoupled from the internal implementations, with cheap ways to convert
>> back and forth.
>>
>> When you created the PR to make InternalRow public, the understanding was
>> to work towards making it stable in the future, assuming we will start with
>> an unstable API temporarily. You can't just make a bunch internal APIs
>> tightly coupled with other internal pieces public and stable and call it a
>> day, just because it happen to satisfy some use cases temporarily assuming
>> the rest of Spark doesn't change.
>>
>>
>>
>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue  wrote:
>>
>> > DSv2 is far from stable right?
>>
>> No, I think it is reasonably stable and very close to being ready for a
>> release.
>>
>> > All the actual data types are unstable and you guys have completely
>> ignored that.
>>
>> I think what you're referring to i

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead
of UnsafeRow, which is a specific implementation of InternalRow. Exposing
this API has always been a part of DSv2 and while both you and I did some
work to avoid this, we are still in the phase of starting with that API.

Note that any change to InternalRow would be very costly to implement
because this interface is widely used. That is why I think we can certainly
consider it stable enough to use here, and that’s probably why UnsafeRow
was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it
was to get the majority of SQL working on top of the interface added after
2.4. That’s done and stable, so I think a 2.5 release with it is also
reasonable.

On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin  wrote:

> To push back, while I agree we should not drastically change
> "InternalRow", there are a lot of changes that need to happen to make it
> stable. For example, none of the publicly exposed interfaces should be in
> the Catalyst package or the unsafe package. External implementations should
> be decoupled from the internal implementations, with cheap ways to convert
> back and forth.
>
> When you created the PR to make InternalRow public, the understanding was
> to work towards making it stable in the future, assuming we will start with
> an unstable API temporarily. You can't just make a bunch internal APIs
> tightly coupled with other internal pieces public and stable and call it a
> day, just because it happen to satisfy some use cases temporarily assuming
> the rest of Spark doesn't change.
>
>
>
> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue  wrote:
>
>> > DSv2 is far from stable right?
>>
>> No, I think it is reasonably stable and very close to being ready for a
>> release.
>>
>> > All the actual data types are unstable and you guys have completely
>> ignored that.
>>
>> I think what you're referring to is the use of `InternalRow`. That's a
>> stable API and there has been no work to avoid using it. In any case, I
>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>> for `InternalRow` is added, right?
>>
>> While I understand the motivation for a better solution here, I think the
>> pragmatic solution is to continue using `InternalRow`.
>>
>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>> invasive of a change to backport once you consider the parts needed to make
>> dsv2 stable.
>>
>> I believe that those of us working on DSv2 are confident about the
>> current stability. We set goals for what to get into the 3.0 release months
>> ago and have very nearly reached the point where we are ready for that
>> release.
>>
>> I don't think instability would be a problem in maintaining compatibility
>> between the 2.5 version and the 3.0 version. If we find that we need to
>> make API changes (other than additions) then we can make those in the 3.1
>> release. Because the goals we set for the 3.0 release have been reached
>> with the current API and if we are ready to release 3.0, we can release a
>> 2.5 with the same API.
>>
>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin  wrote:
>>
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>>
>>
>>
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue 
>> wrote:
>>
>> Hi everyone,
>>
>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>
>> A Spark 2.5 release with these two additions will help people migrate to
>> Spark 3.0 when it is released because they will be able to use a single
>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>> upgrading to 3.0 won't also require also updating to Java 11 because users
>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>
>> Another reason to consider a 2.5 release is that many people are
>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>> it makes sense to share this work with the community.
>>
>> This release

1 2 3 4 >

1 - 100 of 399 matches

Mail list logo