date:20230619

Hi, Herman.

This is a series of discussions as I re-summarized here.

You can find some context in the previous timeline thread.

2023-05-30 Apache Spark 4.0 Timeframe?
https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6

Could you reply there to collect your timeline suggestions? We can discuss
more there.

Dongjoon.



On Mon, Jun 19, 2023 at 1:58 PM Herman van Hovell 
wrote:

> Dongjoon, I am not sure if I am not sure if I follow the line of thought
> here.
>
> Multiple people have asked for clarification on what Spark 4.0 would mean
> (Holden, Mridul, Jia & Xiao). You can - for the record - also add me to
> this list. However you choose to single out Xiao because asks this question
> and wants to do a preview release as well? So again, what does Spark 4
> mean, and why does it need to take almost a year? Historically major Spark
> releases tend to break APIs, but if it only entails changing to Scala 2.13
> and dropping support for JDK 8, then we could also just release a month
> after 3.5.
>
> How about we do this? We get 3.5 released, and afterwards we do a couple
> of meetings where we build this roadmap. Using that, we can - hopefully -
> have a grounded discussion.
>
> Cheers,
> Herman
>
> On Mon, Jun 19, 2023 at 4:01 PM Dongjoon Hyun  wrote:
>
>> Thank you. I reviewed the threads, vote and result once more.
>>
>> I found that I missed the binding vote mark on Holden in the vote result
>> email. The following should be "-0: Holden Karau *". Sorry for this
>> mistake, Holden and all.
>>
>> > -0: Holden Karau
>>
>> To Hyukjin, I disagree with you at the following point because the thread
>> started clearly with your and Sean's Apache Spark 4.0 requirement in order
>> to move away from Scala 2.12. In addition, we also discussed another item
>> (dropping Java 8) from other current dev thread. The vote scope and goal is
>> clear and specific.
>>
>> > we're unclear on the picture of Spark 4.0.0.
>>
>> Instead of vote scope and result, what is really unclear is that what you
>> propose here. If Xiao wants a preview, Xiao can propose the preview plan
>> more. It's welcome. If you want to has many 4.0 dev ideas which are not
>> exposed to the community yet. Please share them with the community. It's
>> welcome, too. Apache Spark is open source community. If you don't share it,
>> there is no way for us to know what you want.
>>
>> Dongjoon
>>
>> On 2023/06/19 04:31:46 Hyukjin Kwon wrote:
>> > The major concerns raised in the thread were that we should initiate the
>> > discussion for the below first:
>> > - Apache Spark 4.0.0 Preview (and Dates)
>> > - Apache Spark 4.0.0 Items
>> > - Apache Spark 4.0.0 Plan Adjustment
>> >
>> > before setting the timeline for Spark 4.0.0 because we're unclear on the
>> > picture of Spark 4.0.0. So discussing the timeline 4.0.0 first is the
>> > opposite order procedurally.
>> > The vote passed as a procedural issue, but I would prefer to consider
>> this
>> > as a tentative date, and should probably need another vote to adjust the
>> > date considering the plans, preview dates, and items we aim for 4.0.0.
>> >
>> >
>> > On Sat, 17 Jun 2023 at 04:33, Dongjoon Hyun 
>> wrote:
>> >
>> > > This was a part of the following on-going discussions.
>> > >
>> > > 2023-05-28  Apache Spark 3.5.0 Expectations (?)
>> > > https://lists.apache.org/thread/3x6dh17bmy20n3frtt3crgxjydnxh2o0
>> > >
>> > > 2023-05-30 Apache Spark 4.0 Timeframe?
>> > > https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6
>> > >
>> > > 2023-06-05 ASF policy violation and Scala version issues
>> > > https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
>> > >
>> > > 2023-06-12 [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)
>> > > https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb
>> > >
>> > > I'm looking forward to seeing the upcoming detailed discussions
>> including
>> > > the following
>> > > - Apache Spark 4.0.0 Preview (and Dates)
>> > > - Apache Spark 4.0.0 Items
>> > > - Apache Spark 4.0.0 Plan Adjustment
>> > >
>> > > Please initiate the discussion.
>> > >
>> > > Thanks,
>> > > Dongjoon.
>> > >
>> > >
>> > > On 2023/06/16 19:30:42 Dongjoon Hyun wrote:
>> > > > The vote passes with 6 +1s (4 binding +1s), one -0, and one -1.
>> > > > Thank you all for your participation and
>> > > > especially your additional comments during this voting,
>> > > > Mridul, Hyukjin, and Jungtaek.
>> > > >
>> > > > (* = binding)
>> > > > +1:
>> > > > - Dongjoon Hyun *
>> > > > - Huaxin Gao *
>> > > > - Liang-Chi Hsieh *
>> > > > - Kazuyuki Tanimura
>> > > > - Chao Sun *
>> > > > - Jia Fan
>> > > >
>> > > > -0: Holden Karau
>> > > >
>> > > > -1: Xiao Li *
>> > > >
>> > >
>> > > -
>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > >
>> > >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Cheng Pan

This API looks starting from scratch and has no relationship with the existing 
Java/Scala DataSourceV2 API. Particularly, how can they support SQL?

We have been back and forth on the DataSource V2 design since 2.3, I believe 
there are some things to learn when introducing the Python DataSource API.

Thanks,
Cheng Pan




> On Jun 16, 2023, at 12:14, Allison Wang  
> wrote:
> 
> Hi everyone,
> 
> I would like to start a discussion on “Python Data Source API”.
> 
> This proposal aims to introduce a simple API in Python for Data Sources. The 
> idea is to enable Python developers to create data sources without having to 
> learn Scala or deal with the complexities of the current data source APIs. 
> The goal is to make a Python-based API that is simple and easy to use, thus 
> making Spark more accessible to the wider Python developer community. This 
> proposed approach is based on the recently introduced Python user-defined 
> table functions with extensions to support data sources.
> 
> SPIP Doc:  
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
> 
> SPIP JIRA: https://issues.apache.org/jira/browse/SPARK-44076
> 
> Looking forward to your feedback.
> 
> Thanks,
> Allison


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Hyukjin Kwon

Actually I support this idea in a way that Python developers don't have to
learn Scala to write their own source (and separate packaging).
This is more crucial especially when you want to write a simple data source
that interacts with the Python ecosystem.

On Tue, 20 Jun 2023 at 03:08, Denny Lee  wrote:

> Slightly biased, but per my conversations - this would be awesome to have!
>
> On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari 
> wrote:
>
>> I would definitely use it - is it's available :)
>>
>> On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:
>>
>>> Hi Allison and devs,
>>>
>>> Although I was against this idea at first sight (probably because I'm a
>>> Scala dev), I think it could work as long as there are people who'd be
>>> interested in such an API. Were there any? I'm just curious. I've seen no
>>> emails requesting it.
>>>
>>> I also doubt that Python devs would like to work on new data sources but
>>> support their wishes wholeheartedly :)
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> "The Internals Of" Online Books 
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>> 
>>>
>>>
>>> On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
>>>  wrote:
>>>
 Hi everyone,

 I would like to start a discussion on “Python Data Source API”.

 This proposal aims to introduce a simple API in Python for Data
 Sources. The idea is to enable Python developers to create data sources
 without having to learn Scala or deal with the complexities of the current
 data source APIs. The goal is to make a Python-based API that is simple and
 easy to use, thus making Spark more accessible to the wider Python
 developer community. This proposed approach is based on the recently
 introduced Python user-defined table functions with extensions to support
 data sources.

 *SPIP Doc*:
 https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing

 *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076

 Looking forward to your feedback.

 Thanks,
 Allison

>>>

Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-19 Thread Jia Fan

+1

Dongjoon Hyun  于2023年6月20日周二 10:41写道：

> Please vote on releasing the following candidate as Apache Spark version
> 3.4.1.
>
> The vote is open until June 23rd 1AM (PST) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.4.1-rc1 (commit
> 6b1ff22dde1ead51cbf370be6e48a802daae58b6)
> https://github.com/apache/spark/tree/v3.4.1-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1443/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/
>
> The list of bug fixes going into 3.4.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12352874
>
> This release is using the release script of the tag v3.4.1-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.1?
> ===
>
> The current list of open tickets targeted at 3.4.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

[VOTE] Release Spark 3.4.1 (RC1)

2023-06-19 Thread Dongjoon Hyun

Please vote on releasing the following candidate as Apache Spark version
3.4.1.

The vote is open until June 23rd 1AM (PST) and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.4.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v3.4.1-rc1 (commit
6b1ff22dde1ead51cbf370be6e48a802daae58b6)
https://github.com/apache/spark/tree/v3.4.1-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1443/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/

The list of bug fixes going into 3.4.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12352874

This release is using the release script of the tag v3.4.1-rc1.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.4.1?
===

The current list of open tickets targeted at 3.4.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.4.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-19 Thread Herman van Hovell

Dongjoon, I am not sure if I am not sure if I follow the line of thought
here.

Multiple people have asked for clarification on what Spark 4.0 would mean
(Holden, Mridul, Jia & Xiao). You can - for the record - also add me to
this list. However you choose to single out Xiao because asks this question
and wants to do a preview release as well? So again, what does Spark 4
mean, and why does it need to take almost a year? Historically major Spark
releases tend to break APIs, but if it only entails changing to Scala 2.13
and dropping support for JDK 8, then we could also just release a month
after 3.5.

How about we do this? We get 3.5 released, and afterwards we do a couple of
meetings where we build this roadmap. Using that, we can - hopefully - have
a grounded discussion.

Cheers,
Herman

On Mon, Jun 19, 2023 at 4:01 PM Dongjoon Hyun  wrote:

> Thank you. I reviewed the threads, vote and result once more.
>
> I found that I missed the binding vote mark on Holden in the vote result
> email. The following should be "-0: Holden Karau *". Sorry for this
> mistake, Holden and all.
>
> > -0: Holden Karau
>
> To Hyukjin, I disagree with you at the following point because the thread
> started clearly with your and Sean's Apache Spark 4.0 requirement in order
> to move away from Scala 2.12. In addition, we also discussed another item
> (dropping Java 8) from other current dev thread. The vote scope and goal is
> clear and specific.
>
> > we're unclear on the picture of Spark 4.0.0.
>
> Instead of vote scope and result, what is really unclear is that what you
> propose here. If Xiao wants a preview, Xiao can propose the preview plan
> more. It's welcome. If you want to has many 4.0 dev ideas which are not
> exposed to the community yet. Please share them with the community. It's
> welcome, too. Apache Spark is open source community. If you don't share it,
> there is no way for us to know what you want.
>
> Dongjoon
>
> On 2023/06/19 04:31:46 Hyukjin Kwon wrote:
> > The major concerns raised in the thread were that we should initiate the
> > discussion for the below first:
> > - Apache Spark 4.0.0 Preview (and Dates)
> > - Apache Spark 4.0.0 Items
> > - Apache Spark 4.0.0 Plan Adjustment
> >
> > before setting the timeline for Spark 4.0.0 because we're unclear on the
> > picture of Spark 4.0.0. So discussing the timeline 4.0.0 first is the
> > opposite order procedurally.
> > The vote passed as a procedural issue, but I would prefer to consider
> this
> > as a tentative date, and should probably need another vote to adjust the
> > date considering the plans, preview dates, and items we aim for 4.0.0.
> >
> >
> > On Sat, 17 Jun 2023 at 04:33, Dongjoon Hyun  wrote:
> >
> > > This was a part of the following on-going discussions.
> > >
> > > 2023-05-28  Apache Spark 3.5.0 Expectations (?)
> > > https://lists.apache.org/thread/3x6dh17bmy20n3frtt3crgxjydnxh2o0
> > >
> > > 2023-05-30 Apache Spark 4.0 Timeframe?
> > > https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6
> > >
> > > 2023-06-05 ASF policy violation and Scala version issues
> > > https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
> > >
> > > 2023-06-12 [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)
> > > https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb
> > >
> > > I'm looking forward to seeing the upcoming detailed discussions
> including
> > > the following
> > > - Apache Spark 4.0.0 Preview (and Dates)
> > > - Apache Spark 4.0.0 Items
> > > - Apache Spark 4.0.0 Plan Adjustment
> > >
> > > Please initiate the discussion.
> > >
> > > Thanks,
> > > Dongjoon.
> > >
> > >
> > > On 2023/06/16 19:30:42 Dongjoon Hyun wrote:
> > > > The vote passes with 6 +1s (4 binding +1s), one -0, and one -1.
> > > > Thank you all for your participation and
> > > > especially your additional comments during this voting,
> > > > Mridul, Hyukjin, and Jungtaek.
> > > >
> > > > (* = binding)
> > > > +1:
> > > > - Dongjoon Hyun *
> > > > - Huaxin Gao *
> > > > - Liang-Chi Hsieh *
> > > > - Kazuyuki Tanimura
> > > > - Chao Sun *
> > > > - Jia Fan
> > > >
> > > > -0: Holden Karau
> > > >
> > > > -1: Xiao Li *
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-19 Thread Dongjoon Hyun

Thank you. I reviewed the threads, vote and result once more.

I found that I missed the binding vote mark on Holden in the vote result email. 
The following should be "-0: Holden Karau *". Sorry for this mistake, Holden 
and all.

> -0: Holden Karau

To Hyukjin, I disagree with you at the following point because the thread 
started clearly with your and Sean's Apache Spark 4.0 requirement in order to 
move away from Scala 2.12. In addition, we also discussed another item 
(dropping Java 8) from other current dev thread. The vote scope and goal is 
clear and specific.

> we're unclear on the picture of Spark 4.0.0. 

Instead of vote scope and result, what is really unclear is that what you 
propose here. If Xiao wants a preview, Xiao can propose the preview plan more. 
It's welcome. If you want to has many 4.0 dev ideas which are not exposed to 
the community yet. Please share them with the community. It's welcome, too. 
Apache Spark is open source community. If you don't share it, there is no way 
for us to know what you want.

Dongjoon

On 2023/06/19 04:31:46 Hyukjin Kwon wrote:
> The major concerns raised in the thread were that we should initiate the
> discussion for the below first:
> - Apache Spark 4.0.0 Preview (and Dates)
> - Apache Spark 4.0.0 Items
> - Apache Spark 4.0.0 Plan Adjustment
> 
> before setting the timeline for Spark 4.0.0 because we're unclear on the
> picture of Spark 4.0.0. So discussing the timeline 4.0.0 first is the
> opposite order procedurally.
> The vote passed as a procedural issue, but I would prefer to consider this
> as a tentative date, and should probably need another vote to adjust the
> date considering the plans, preview dates, and items we aim for 4.0.0.
> 
> 
> On Sat, 17 Jun 2023 at 04:33, Dongjoon Hyun  wrote:
> 
> > This was a part of the following on-going discussions.
> >
> > 2023-05-28  Apache Spark 3.5.0 Expectations (?)
> > https://lists.apache.org/thread/3x6dh17bmy20n3frtt3crgxjydnxh2o0
> >
> > 2023-05-30 Apache Spark 4.0 Timeframe?
> > https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6
> >
> > 2023-06-05 ASF policy violation and Scala version issues
> > https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
> >
> > 2023-06-12 [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)
> > https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb
> >
> > I'm looking forward to seeing the upcoming detailed discussions including
> > the following
> > - Apache Spark 4.0.0 Preview (and Dates)
> > - Apache Spark 4.0.0 Items
> > - Apache Spark 4.0.0 Plan Adjustment
> >
> > Please initiate the discussion.
> >
> > Thanks,
> > Dongjoon.
> >
> >
> > On 2023/06/16 19:30:42 Dongjoon Hyun wrote:
> > > The vote passes with 6 +1s (4 binding +1s), one -0, and one -1.
> > > Thank you all for your participation and
> > > especially your additional comments during this voting,
> > > Mridul, Hyukjin, and Jungtaek.
> > >
> > > (* = binding)
> > > +1:
> > > - Dongjoon Hyun *
> > > - Huaxin Gao *
> > > - Liang-Chi Hsieh *
> > > - Kazuyuki Tanimura
> > > - Chao Sun *
> > > - Jia Fan
> > >
> > > -0: Holden Karau
> > >
> > > -1: Xiao Li *
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Denny Lee

Slightly biased, but per my conversations - this would be awesome to have!

On Mon, Jun 19, 2023 at 09:43 Abdeali Kothari 
wrote:

> I would definitely use it - is it's available :)
>
> On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:
>
>> Hi Allison and devs,
>>
>> Although I was against this idea at first sight (probably because I'm a
>> Scala dev), I think it could work as long as there are people who'd be
>> interested in such an API. Were there any? I'm just curious. I've seen no
>> emails requesting it.
>>
>> I also doubt that Python devs would like to work on new data sources but
>> support their wishes wholeheartedly :)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
>>  wrote:
>>
>>> Hi everyone,
>>>
>>> I would like to start a discussion on “Python Data Source API”.
>>>
>>> This proposal aims to introduce a simple API in Python for Data Sources.
>>> The idea is to enable Python developers to create data sources without
>>> having to learn Scala or deal with the complexities of the current data
>>> source APIs. The goal is to make a Python-based API that is simple and easy
>>> to use, thus making Spark more accessible to the wider Python developer
>>> community. This proposed approach is based on the recently introduced
>>> Python user-defined table functions with extensions to support data sources.
>>>
>>> *SPIP Doc*:
>>> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>>>
>>> *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076
>>>
>>> Looking forward to your feedback.
>>>
>>> Thanks,
>>> Allison
>>>
>>

unsubscribe

2023-06-19 Thread Bharat Kul Ratan

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Abdeali Kothari

I would definitely use it - is it's available :)

On Mon, 19 Jun 2023, 21:56 Jacek Laskowski,  wrote:

> Hi Allison and devs,
>
> Although I was against this idea at first sight (probably because I'm a
> Scala dev), I think it could work as long as there are people who'd be
> interested in such an API. Were there any? I'm just curious. I've seen no
> emails requesting it.
>
> I also doubt that Python devs would like to work on new data sources but
> support their wishes wholeheartedly :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
>  wrote:
>
>> Hi everyone,
>>
>> I would like to start a discussion on “Python Data Source API”.
>>
>> This proposal aims to introduce a simple API in Python for Data Sources.
>> The idea is to enable Python developers to create data sources without
>> having to learn Scala or deal with the complexities of the current data
>> source APIs. The goal is to make a Python-based API that is simple and easy
>> to use, thus making Spark more accessible to the wider Python developer
>> community. This proposed approach is based on the recently introduced
>> Python user-defined table functions with extensions to support data sources.
>>
>> *SPIP Doc*:
>> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>>
>> *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076
>>
>> Looking forward to your feedback.
>>
>> Thanks,
>> Allison
>>
>

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-19 Thread Jacek Laskowski

Hi Allison and devs,

Although I was against this idea at first sight (probably because I'm a
Scala dev), I think it could work as long as there are people who'd be
interested in such an API. Were there any? I'm just curious. I've seen no
emails requesting it.

I also doubt that Python devs would like to work on new data sources but
support their wishes wholeheartedly :)

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Fri, Jun 16, 2023 at 6:14 AM Allison Wang
 wrote:

> Hi everyone,
>
> I would like to start a discussion on “Python Data Source API”.
>
> This proposal aims to introduce a simple API in Python for Data Sources.
> The idea is to enable Python developers to create data sources without
> having to learn Scala or deal with the complexities of the current data
> source APIs. The goal is to make a Python-based API that is simple and easy
> to use, thus making Spark more accessible to the wider Python developer
> community. This proposed approach is based on the recently introduced
> Python user-defined table functions with extensions to support data sources.
>
> *SPIP Doc*:
> https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing
>
> *SPIP JIRA*: https://issues.apache.org/jira/browse/SPARK-44076
>
> Looking forward to your feedback.
>
> Thanks,
> Allison
>

Re: Data Contracts

2023-06-19 Thread Deepak Sharma

Sorry for using simple in my last email .
It’s not gonna to be simple in any terms .
Thanks for sharing the git Philip .
Will definitely go through it .

Thanks
Deepak

On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry 
wrote:

> I think it might be a bit more complicated than this (but happy to be
> proved wrong).
>
> I have a minimum working example at:
>
> https://github.com/PhillHenry/SparkConstraints.git
>
> that runs out-of-the-box (mvn test) and demonstrates what I am trying to
> achieve.
>
> A test persists a DataFrame that conforms to the contract and demonstrates
> that one that does not, throws an Exception.
>
> I've had to slightly modify 3 Spark files to add the data contract
> functionality. If you can think of a more elegant solution, I'd be very
> grateful.
>
> Regards,
>
> Phillip
>
>
>
>
> On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma 
> wrote:
>
>> It can be as simple as adding a function to the spark session builder
>> specifically on the read  which can take the yaml file(definition if data
>> co tracts to be in yaml) and apply it to the data frame .
>> It can ignore the rows not matching the data contracts defined in the
>> yaml .
>>
>> Thanks
>> Deepak
>>
>> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry 
>> wrote:
>>
>>> For my part, I'm not too concerned about the mechanism used to implement
>>> the validation as long as it's rich enough to express the constraints.
>>>
>>> I took a look at JSON Schemas (for which there are a number of JVM
>>> implementations) but I don't think it can handle more complex data types
>>> like dates. Maybe Elliot can comment on this?
>>>
>>> Ideally, *any* reasonable mechanism could be plugged in.
>>>
>>> But what struck me from trying to write a Proof of Concept was that it
>>> was quite hard to inject my code into this particular area of the Spark
>>> machinery. It could very well be due to my limited understanding of the
>>> codebase, but it seemed the Spark code would need a bit of a refactor
>>> before a component could be injected. Maybe people in this forum with
>>> greater knowledge in this area could comment?
>>>
>>> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear
>>> to be attempting to implement data contracts within their ecosystem.
>>> Unfortunately, I think it's closed source and Python only.
>>>
>>> Regards,
>>>
>>> Phillip
>>>
>>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 It would be interesting if we think about creating a contract
 validation library written in JSON format. This would ensure a validation
 mechanism that will rely on this library and could be shared among relevant
 parties. Will that be a starting point?

 HTH

 Mich Talebzadeh,
 Lead Solutions Architect/Engineering Lead
 Palantir Technologies Limited
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin  wrote:

> Hi,
>
> While I was at PayPal, we open sourced a template of Data Contract, it
> is here: https://github.com/paypal/data-contract-template. Companies
> like GX (Great Expectations) are interested in using it.
>
> Spark could read some elements form it pretty easily, like schema
> validation, some rules validations. Spark could also generate an embryo of
> data contracts…
>
> —jgp
>
>
> On Jun 13, 2023, at 07:25, Mich Talebzadeh 
> wrote:
>
> From my limited understanding of data contracts, there are two factors
> that deem necessary.
>
>
>1. procedure matter
>2. technical matter
>
> I mean this is nothing new. Some tools like Cloud data fusion can
> assist when the procedures are validated. Simply "The process of
> integrating multiple data sources to produce more consistent, accurate, 
> and
> useful information than that provided by any individual data source.". In
> the old time, we had staging tables that were used to clean and prune data
> from multiple sources. Nowadays we use the so-called Integration layer. If
> you use Spark as an ETL tool, then you have to build this validation
> yourself. Case in point, how to map customer_id from one source to
> customer_no from another. Legacy systems are full of these anomalies. MDM
> can help but requires human intervention which is time consuming. I am not
> sure the

Re: Data Contracts

2023-06-19 Thread Phillip Henry

I think it might be a bit more complicated than this (but happy to be
proved wrong).

I have a minimum working example at:

https://github.com/PhillHenry/SparkConstraints.git

that runs out-of-the-box (mvn test) and demonstrates what I am trying to
achieve.

A test persists a DataFrame that conforms to the contract and demonstrates
that one that does not, throws an Exception.

I've had to slightly modify 3 Spark files to add the data contract
functionality. If you can think of a more elegant solution, I'd be very
grateful.

Regards,

Phillip

On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma  wrote:

> It can be as simple as adding a function to the spark session builder
> specifically on the read  which can take the yaml file(definition if data
> co tracts to be in yaml) and apply it to the data frame .
> It can ignore the rows not matching the data contracts defined in the yaml
> .
>
> Thanks
> Deepak
>
> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry 
> wrote:
>
>> For my part, I'm not too concerned about the mechanism used to implement
>> the validation as long as it's rich enough to express the constraints.
>>
>> I took a look at JSON Schemas (for which there are a number of JVM
>> implementations) but I don't think it can handle more complex data types
>> like dates. Maybe Elliot can comment on this?
>>
>> Ideally, *any* reasonable mechanism could be plugged in.
>>
>> But what struck me from trying to write a Proof of Concept was that it
>> was quite hard to inject my code into this particular area of the Spark
>> machinery. It could very well be due to my limited understanding of the
>> codebase, but it seemed the Spark code would need a bit of a refactor
>> before a component could be injected. Maybe people in this forum with
>> greater knowledge in this area could comment?
>>
>> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear
>> to be attempting to implement data contracts within their ecosystem.
>> Unfortunately, I think it's closed source and Python only.
>>
>> Regards,
>>
>> Phillip
>>
>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It would be interesting if we think about creating a contract validation
>>> library written in JSON format. This would ensure a validation mechanism
>>> that will rely on this library and could be shared among relevant parties.
>>> Will that be a starting point?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin  wrote:
>>>
 Hi,

 While I was at PayPal, we open sourced a template of Data Contract, it
 is here: https://github.com/paypal/data-contract-template. Companies
 like GX (Great Expectations) are interested in using it.

 Spark could read some elements form it pretty easily, like schema
 validation, some rules validations. Spark could also generate an embryo of
 data contracts…

 —jgp

 On Jun 13, 2023, at 07:25, Mich Talebzadeh 
 wrote:

 From my limited understanding of data contracts, there are two factors
 that deem necessary.

1. procedure matter
2. technical matter

 I mean this is nothing new. Some tools like Cloud data fusion can
 assist when the procedures are validated. Simply "The process of
 integrating multiple data sources to produce more consistent, accurate, and
 useful information than that provided by any individual data source.". In
 the old time, we had staging tables that were used to clean and prune data
 from multiple sources. Nowadays we use the so-called Integration layer. If
 you use Spark as an ETL tool, then you have to build this validation
 yourself. Case in point, how to map customer_id from one source to
 customer_no from another. Legacy systems are full of these anomalies. MDM
 can help but requires human intervention which is time consuming. I am not
 sure the role of Spark here except being able to read the mapping tables.

 HTH

 Mich Talebzadeh,
 Lead Solutions Architect/Engineering Lead
 Palantir Technologies Limited
 London
 United Kingdom

view my Linkedin profile

Re: Data Contracts

2023-06-19 Thread Deepak Sharma

It can be as simple as adding a function to the spark session builder
specifically on the read  which can take the yaml file(definition if data
co tracts to be in yaml) and apply it to the data frame .
It can ignore the rows not matching the data contracts defined in the yaml .

Thanks
Deepak

On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry 
wrote:

> For my part, I'm not too concerned about the mechanism used to implement
> the validation as long as it's rich enough to express the constraints.
>
> I took a look at JSON Schemas (for which there are a number of JVM
> implementations) but I don't think it can handle more complex data types
> like dates. Maybe Elliot can comment on this?
>
> Ideally, *any* reasonable mechanism could be plugged in.
>
> But what struck me from trying to write a Proof of Concept was that it was
> quite hard to inject my code into this particular area of the Spark
> machinery. It could very well be due to my limited understanding of the
> codebase, but it seemed the Spark code would need a bit of a refactor
> before a component could be injected. Maybe people in this forum with
> greater knowledge in this area could comment?
>
> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear
> to be attempting to implement data contracts within their ecosystem.
> Unfortunately, I think it's closed source and Python only.
>
> Regards,
>
> Phillip
>
> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> It would be interesting if we think about creating a contract validation
>> library written in JSON format. This would ensure a validation mechanism
>> that will rely on this library and could be shared among relevant parties.
>> Will that be a starting point?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin  wrote:
>>
>>> Hi,
>>>
>>> While I was at PayPal, we open sourced a template of Data Contract, it
>>> is here: https://github.com/paypal/data-contract-template. Companies
>>> like GX (Great Expectations) are interested in using it.
>>>
>>> Spark could read some elements form it pretty easily, like schema
>>> validation, some rules validations. Spark could also generate an embryo of
>>> data contracts…
>>>
>>> —jgp
>>>
>>>
>>> On Jun 13, 2023, at 07:25, Mich Talebzadeh 
>>> wrote:
>>>
>>> From my limited understanding of data contracts, there are two factors
>>> that deem necessary.
>>>
>>>
>>>1. procedure matter
>>>2. technical matter
>>>
>>> I mean this is nothing new. Some tools like Cloud data fusion can assist
>>> when the procedures are validated. Simply "The process of integrating
>>> multiple data sources to produce more consistent, accurate, and useful
>>> information than that provided by any individual data source.". In the old
>>> time, we had staging tables that were used to clean and prune data from
>>> multiple sources. Nowadays we use the so-called Integration layer. If you
>>> use Spark as an ETL tool, then you have to build this validation yourself.
>>> Case in point, how to map customer_id from one source to customer_no from
>>> another. Legacy systems are full of these anomalies. MDM can help but
>>> requires human intervention which is time consuming. I am not sure the role
>>> of Spark here except being able to read the mapping tables.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry 
>>> wrote:
>>>
 Hi, Fokko and Deepak.

 The problem with DBT and Great Expectations (and Soda too, I believe)
 is that by the time they find the problem, the error is already in
 production - and fixing production can be a nightmare.

 What's more, we've found that nobody ever looks

Re: Data Contracts

2023-06-19 Thread Phillip Henry

For my part, I'm not too concerned about the mechanism used to implement
the validation as long as it's rich enough to express the constraints.

I took a look at JSON Schemas (for which there are a number of JVM
implementations) but I don't think it can handle more complex data types
like dates. Maybe Elliot can comment on this?

Ideally, *any* reasonable mechanism could be plugged in.

But what struck me from trying to write a Proof of Concept was that it was
quite hard to inject my code into this particular area of the Spark
machinery. It could very well be due to my limited understanding of the
codebase, but it seemed the Spark code would need a bit of a refactor
before a component could be injected. Maybe people in this forum with
greater knowledge in this area could comment?

BTW, it's interesting to see that Databrick's "Delta Live Tables" appear to
be attempting to implement data contracts within their ecosystem.
Unfortunately, I think it's closed source and Python only.

Regards,

Phillip

On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh 
wrote:

> It would be interesting if we think about creating a contract validation
> library written in JSON format. This would ensure a validation mechanism
> that will rely on this library and could be shared among relevant parties.
> Will that be a starting point?
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin  wrote:
>
>> Hi,
>>
>> While I was at PayPal, we open sourced a template of Data Contract, it is
>> here: https://github.com/paypal/data-contract-template. Companies like
>> GX (Great Expectations) are interested in using it.
>>
>> Spark could read some elements form it pretty easily, like schema
>> validation, some rules validations. Spark could also generate an embryo of
>> data contracts…
>>
>> —jgp
>>
>>
>> On Jun 13, 2023, at 07:25, Mich Talebzadeh 
>> wrote:
>>
>> From my limited understanding of data contracts, there are two factors
>> that deem necessary.
>>
>>
>>1. procedure matter
>>2. technical matter
>>
>> I mean this is nothing new. Some tools like Cloud data fusion can assist
>> when the procedures are validated. Simply "The process of integrating
>> multiple data sources to produce more consistent, accurate, and useful
>> information than that provided by any individual data source.". In the old
>> time, we had staging tables that were used to clean and prune data from
>> multiple sources. Nowadays we use the so-called Integration layer. If you
>> use Spark as an ETL tool, then you have to build this validation yourself.
>> Case in point, how to map customer_id from one source to customer_no from
>> another. Legacy systems are full of these anomalies. MDM can help but
>> requires human intervention which is time consuming. I am not sure the role
>> of Spark here except being able to read the mapping tables.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry 
>> wrote:
>>
>>> Hi, Fokko and Deepak.
>>>
>>> The problem with DBT and Great Expectations (and Soda too, I believe) is
>>> that by the time they find the problem, the error is already in production
>>> - and fixing production can be a nightmare.
>>>
>>> What's more, we've found that nobody ever looks at the data quality
>>> reports we already generate.
>>>
>>> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's
>>> usually against synthetic or at best sampled data (laws like GDPR generally
>>> stop personal information data being anywhere but prod).
>>>
>>> What I'm proposing is something that stops production data ever being
>>> tainted.
>>>
>>> Hi, Elliot.
>>>
>>> Nice to see you again (we worked together 20 years ago)!
>>>
>>> The problem here is that a schema itself won't protect me (at

unsubscribe

Re: [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)

Re: [DISCUSS] SPIP: Python Data Source API

Re: [DISCUSS] SPIP: Python Data Source API

Re: [VOTE] Release Spark 3.4.1 (RC1)

[VOTE] Release Spark 3.4.1 (RC1)

Re: [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)

Re: [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)

Re: [DISCUSS] SPIP: Python Data Source API

unsubscribe

Re: [DISCUSS] SPIP: Python Data Source API

Re: [DISCUSS] SPIP: Python Data Source API

Re: Data Contracts

Re: Data Contracts

Re: Data Contracts

Re: Data Contracts

16 matches

Site Navigation

Mail list logo

Footer information