Re: [DISCUSS] SPIP: Add PySpark Test Framework

2023-06-13 Thread Hyukjin Kwon
Yeah, I have been thinking about this too, and Holden did some work here
that this SPIP will reuse. I support this.

On Wed, 14 Jun 2023 at 08:10, Amanda Liu 
wrote:

> Hi all,
>
> I'd like to start a discussion about implementing an official PySpark test
> framework. Currently, there's no official test framework, but only various
> open-source repos and blog posts.
>
> Many of these open-source resources are very popular, which demonstrates
> user-demand for PySpark testing capabilities. spark-testing-base
>  has 1.4k stars, and chispa
>  has 532k downloads/month. However,
> it can be confusing for users to piece together disparate resources to
> write their own PySpark tests (see The Elephant in the Room: How to Write
> PySpark Tests
> 
> ).
>
> We can streamline and simplify the testing process by incorporating test
> features, such as a PySpark Test Base class (which allows tests to share
> Spark sessions) and test util functions (for example, asserting dataframe
> and schema equality).
>
> Please see the SPIP document attached:
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07vAnd
> the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44042
>
> I would appreciate it if you could share your thoughts on this proposal.
>
> Thank you!
> Amanda Liu
>


[no subject]

2023-06-13 Thread Amanda Liu



[DISCUSS] SPIP: Add PySpark Test Framework

2023-06-13 Thread Amanda Liu
Hi all,

I'd like to start a discussion about implementing an official PySpark test
framework. Currently, there's no official test framework, but only various
open-source repos and blog posts.

Many of these open-source resources are very popular, which demonstrates
user-demand for PySpark testing capabilities. spark-testing-base
 has 1.4k stars, and chispa
 has 532k downloads/month. However, it
can be confusing for users to piece together disparate resources to write
their own PySpark tests (see The Elephant in the Room: How to Write PySpark
Tests

).

We can streamline and simplify the testing process by incorporating test
features, such as a PySpark Test Base class (which allows tests to share
Spark sessions) and test util functions (for example, asserting dataframe
and schema equality).

Please see the SPIP document attached:
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07vAnd
the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44042

I would appreciate it if you could share your thoughts on this proposal.

Thank you!
Amanda Liu


Re: Apache Spark 4.0 Timeframe?

2023-06-13 Thread Dongjoon Hyun
It's great to hear from you that you are interested in the discussion of
Apache Spark 4.0 scope and timeframe. :)

This is the initial thread which includes upgrading the default Scala
version to 2.13 or 3.3 is the one of the reasons.

As we know, Apache Spark 4.0 is not limited to this. For example, dropping
Java 8 is also discussed in "JDK version support policy?" thread separately.

In principle, this thread aims to avoid blocking the Apache Spark community
by saying 'No for all; (1) No until Apache Spark 4.0 and (2) No 4.0 because
we don't have much".

So, could you share your opinion for Apache Spark 4.0 here?

Best,
Dongjoon.


On Wed, May 31, 2023 at 6:02 PM Dongjoon Hyun  wrote:

> Hi, All.
>
> I'd like to propose to start to prepare Apache Spark 4.0 after creating
> branch-3.5 on July 16th.
>
> - https://spark.apache.org/versioning-policy.html
>
> Historically, the Apache Spark release dates have the following timeframes
> and we already have Spark 3.5 plan which will be maintained up to 2026.
>
> Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
> Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
> Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
> Spark 4: 2024.06 (4.0.0, NEW)
>
> As we discussed in the previous email thread, `Apache Spark 3.5.0
> Expectations`, we cannot deliver some features without Apache Spark 4.
>
> - "I wonder if it’s safer to do it in Spark 4 (which I believe will be
> discussed soon)."
> - "I would make it the default at 4.0, myself."
>
> Although there exist more other features, let's focus on Scala language
> support history.
>
> Spark 2.0: SPARK-6363 Make Scala 2.11 the default Scala version (2016.07)
> Spark 3.0: SPARK-25956 Make Scala 2.12 as default Scala version in Spark
> 3.0 (2020.06)
>
> In addition, the Scala community released Scala 3.3.0 LTS yesterday.
>
> - https://scala-lang.org/blog/2023/05/30/scala-3.3.0-released.html
>
> If we decide to start, I believe we can support Scala 2.13 or Scala 3.3
> next year with Apache Spark 4 while supporting Spark 3.4 and 3.5 for Scala
> 2.12 users.
>
> WDYT?
>
> Thanks,
> Dongjoon.
>


Re: JDK version support policy?

2023-06-13 Thread David Li
Thanks all for the discussion here. Based on this I think we'll stick with Java 
8 for now and then upgrade to Java 11 around or after Spark 4.

On Thu, Jun 8, 2023, at 07:17, Sean Owen wrote:
> Noted, but for that you'd simply run your app on Java 17. If Spark works, and 
> your app's dependencies work on Java 17 because you compile it for 17 (and 
> jakarta.* classes for example) then there's no issue.
> 
> On Thu, Jun 8, 2023 at 3:13 AM Martin Andersson  
> wrote:
>> There are some reasons to drop Java 11 as well. Java 17 included a large 
>> change, breaking backwards compatibility with their transition from Java EE 
>> to Jakarta EE 
>> .
>>  This means that any users using Spark 4.0 together with Spring 6.x or any 
>> recent version of servlet containers such as Tomcat or Jetty will experience 
>> issues. (For security reasons it's beneficial to float your dependencies to 
>> the latest version of these libraries/frameworks)
>> 
>> I'm not explicitly saying Java 11 should be dropped in Spark 4, just thought 
>> I'd bring this issue to your attention.
>> 
>> Best Regards, Martin
>> 
>> 
>> *From:* Jungtaek Lim 
>> *Sent:* Wednesday, June 7, 2023 23:19
>> *To:* Sean Owen 
>> *Cc:* Dongjoon Hyun ; Holden Karau 
>> ; dev 
>> *Subject:* Re: JDK version support policy?
>>  
>> 
>> 
>> EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
>> the sender and know the content is safe. DO NOT provide your username or 
>> password.
>> 
>> 
>> 
>> +1 to drop Java 8 but +1 to set the lowest support version to Java 11.
>> 
>> Considering the phase for only security updates, 11 LTS would not be EOLed 
>> in very long time. Unless that’s coupled with other deps which require 
>> bumping JDK version (hope someone can bring up lists), it doesn’t seem to 
>> buy much. And given the strong backward compatibility JDK provides, that’s 
>> less likely.
>> 
>> Purely from the project’s source code view, does anyone know how much 
>> benefits we can leverage for picking up 17 rather than 11? I lost the track, 
>> but some of their proposals are more likely catching up with other 
>> languages, which don’t make us be happy since Scala provides them for years.
>> 
>> 2023년 6월 8일 (목) 오전 2:35, Sean Owen 님이 작성:
>>> I also generally perceive that, after Java 9, there is much less breaking 
>>> change. So working on Java 11 probably means it works on 20, or can be 
>>> easily made to without pain. Like I think the tweaks for Java 17 were quite 
>>> small. 
>>> 
>>> Targeting Java >11 excludes Java 11 users and probably wouldn't buy much. 
>>> Keeping the support probably doesn't interfere with working on much newer 
>>> JVMs either. 
>>> 
>>> On Wed, Jun 7, 2023, 12:29 PM Holden Karau  wrote:
 So JDK 11 is still supported in open JDK until 2026, I'm not sure if we're 
 going to see enough folks moving to JRE17 by the Spark 4 release unless we 
 have a strong benefit from dropping 11 support I'd be inclined to keep it.
 
 On Tue, Jun 6, 2023 at 9:08 PM Dongjoon Hyun  wrote:
> I'm also +1 on dropping both Java 8 and 11 in Apache Spark 4.0, too.
> 
> Dongjoon.
> 
> On 2023/06/07 02:42:19 yangjie01 wrote:
> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only 
> > support Java 17 and the upcoming Java 21.
> > 
> > 发件人: Denny Lee 
> > 日期: 2023年6月7日 星期三 07:10
> > 收件人: Sean Owen 
> > 抄送: David Li , "dev@spark.apache.org" 
> > 
> > 主题: Re: JDK version support policy?
> > 
> > +1 on dropping Java 8 in Spark 4.0, saying this as a fan of the 
> > fast-paced (positive) updates to Arrow, eh?!
> > 
> > On Tue, Jun 6, 2023 at 4:02 PM Sean Owen 
> > mailto:sro...@gmail.com>> wrote:
> > I haven't followed this discussion closely, but I think we could/should 
> > drop Java 8 in Spark 4.0, which is up next after 3.5?
> > 
> > On Tue, Jun 6, 2023 at 2:44 PM David Li 
> > mailto:lidav...@apache.org>> wrote:
> > Hello Spark developers,
> > 
> > I'm from the Apache Arrow project. We've discussed Java version support 
> > [1], and crucially, whether to continue supporting Java 8 or not. As 
> > Spark is a big user of Arrow in Java, I was curious what Spark's policy 
> > here was.
> > 
> > If Spark intends to stay on Java 8, for instance, we may also want to 
> > stay on Java 8 or otherwise provide some supported version of Arrow for 
> > Java 8.
> > 
> > We've seen dependencies dropping or planning to drop support. gRPC may 
> > drop Java 8 at any time [2], possibly this September [3], which may 
> > affect Spark (due to Spark Connect). And today we saw that Arrow had 
> > issues running tests with Mockito on Java 20, but we couldn't update 
> > Mockito since it had dropped Java 8 support. (We pinned the JDK version 
> > in that CI pipeline for 

Re: Data Contracts

2023-06-13 Thread Mich Talebzadeh
>From my limited understanding of data contracts, there are two factors that
deem necessary.


   1. procedure matter
   2. technical matter

I mean this is nothing new. Some tools like Cloud data fusion can assist
when the procedures are validated. Simply "The process of integrating
multiple data sources to produce more consistent, accurate, and useful
information than that provided by any individual data source.". In the old
time, we had staging tables that were used to clean and prune data from
multiple sources. Nowadays we use the so-called Integration layer. If you
use Spark as an ETL tool, then you have to build this validation yourself.
Case in point, how to map customer_id from one source to customer_no from
another. Legacy systems are full of these anomalies. MDM can help but
requires human intervention which is time consuming. I am not sure the role
of Spark here except being able to read the mapping tables.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 13 Jun 2023 at 10:01, Phillip Henry  wrote:

> Hi, Fokko and Deepak.
>
> The problem with DBT and Great Expectations (and Soda too, I believe) is
> that by the time they find the problem, the error is already in production
> - and fixing production can be a nightmare.
>
> What's more, we've found that nobody ever looks at the data quality
> reports we already generate.
>
> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's
> usually against synthetic or at best sampled data (laws like GDPR generally
> stop personal information data being anywhere but prod).
>
> What I'm proposing is something that stops production data ever being
> tainted.
>
> Hi, Elliot.
>
> Nice to see you again (we worked together 20 years ago)!
>
> The problem here is that a schema itself won't protect me (at least as I
> understand your argument). For instance, I have medical records that say
> some of my patients are 999 years old which is clearly ridiculous but their
> age correctly conforms to an integer data type. I have other patients who
> were discharged *before* they were admitted to hospital. I have 28
> patients out of literally millions who recently attended hospital but were
> discharged on 1/1/1900. As you can imagine, this made the average length of
> stay (a key metric for acute hospitals) much lower than it should have
> been. It only came to light when some average length of stays were
> negative!
>
> In all these cases, the data faithfully adhered to the schema.
>
> Hi, Ryan.
>
> This is an interesting point. There *should* indeed be a human connection
> but often there isn't. For instance, I have a friend who complained that
> his company's Zurich office made a breaking change and was not even aware
> that his London based department existed, never mind depended on their
> data. In large organisations, this is pretty common.
>
> TBH, my proposal doesn't address this particular use case (maybe hooks and
> metastore listeners would...?) But my point remains that although these
> relationships should exist, in a sufficiently large organisation, they
> generally don't. And maybe we can help fix that with code?
>
> Would love to hear further thoughts.
>
> Regards,
>
> Phillip
>
>
>
>
>
> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong  wrote:
>
>> Hey Phillip,
>>
>> Thanks for raising this. I like the idea. The question is, should this be
>> implemented in Spark or some other framework? I know that dbt has a fairly
>> extensive way of testing your data
>> , and making sure that you
>> can enforce assumptions on the columns. The nice thing about dbt is that it
>> is built from a software engineering perspective, so all the tests (or
>> contracts) are living in version control. Using pull requests you could
>> collaborate on changing the contract and making sure that the change has
>> gotten enough attention before pushing it to production. Hope this helps!
>>
>> Kind regards,
>> Fokko
>>
>> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma :
>>
>>> Spark can be used with tools like great expectations as well to
>>> implement the data contracts .
>>> I am not sure though if spark alone can do the data contracts .
>>> I was reading a blog on data mesh and how to glue it together with data
>>> contracts , that’s where I came across this spark and great expectations
>>> mention .
>>>
>>> HTH
>>>
>>> -Deepak
>>>
>>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot 

Re: Data Contracts

2023-06-13 Thread Phillip Henry
Hi, Fokko and Deepak.

The problem with DBT and Great Expectations (and Soda too, I believe) is
that by the time they find the problem, the error is already in production
- and fixing production can be a nightmare.

What's more, we've found that nobody ever looks at the data quality reports
we already generate.

You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's
usually against synthetic or at best sampled data (laws like GDPR generally
stop personal information data being anywhere but prod).

What I'm proposing is something that stops production data ever being
tainted.

Hi, Elliot.

Nice to see you again (we worked together 20 years ago)!

The problem here is that a schema itself won't protect me (at least as I
understand your argument). For instance, I have medical records that say
some of my patients are 999 years old which is clearly ridiculous but their
age correctly conforms to an integer data type. I have other patients who
were discharged *before* they were admitted to hospital. I have 28 patients
out of literally millions who recently attended hospital but were
discharged on 1/1/1900. As you can imagine, this made the average length of
stay (a key metric for acute hospitals) much lower than it should have
been. It only came to light when some average length of stays were
negative!

In all these cases, the data faithfully adhered to the schema.

Hi, Ryan.

This is an interesting point. There *should* indeed be a human connection
but often there isn't. For instance, I have a friend who complained that
his company's Zurich office made a breaking change and was not even aware
that his London based department existed, never mind depended on their
data. In large organisations, this is pretty common.

TBH, my proposal doesn't address this particular use case (maybe hooks and
metastore listeners would...?) But my point remains that although these
relationships should exist, in a sufficiently large organisation, they
generally don't. And maybe we can help fix that with code?

Would love to hear further thoughts.

Regards,

Phillip





On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong  wrote:

> Hey Phillip,
>
> Thanks for raising this. I like the idea. The question is, should this be
> implemented in Spark or some other framework? I know that dbt has a fairly
> extensive way of testing your data
> , and making sure that you
> can enforce assumptions on the columns. The nice thing about dbt is that it
> is built from a software engineering perspective, so all the tests (or
> contracts) are living in version control. Using pull requests you could
> collaborate on changing the contract and making sure that the change has
> gotten enough attention before pushing it to production. Hope this helps!
>
> Kind regards,
> Fokko
>
> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma :
>
>> Spark can be used with tools like great expectations as well to implement
>> the data contracts .
>> I am not sure though if spark alone can do the data contracts .
>> I was reading a blog on data mesh and how to glue it together with data
>> contracts , that’s where I came across this spark and great expectations
>> mention .
>>
>> HTH
>>
>> -Deepak
>>
>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West  wrote:
>>
>>> Hi Phillip,
>>>
>>> While not as fine-grained as your example, there do exist schema systems
>>> such as that in Avro that can can evaluate compatible and incompatible
>>> changes to the schema, from the perspective of the reader, writer, or both.
>>> This provides some potential degree of enforcement, and means to
>>> communicate a contract. Interestingly I believe this approach has been
>>> applied to both JsonSchema and protobuf as part of the Confluent Schema
>>> registry.
>>>
>>> Elliot.
>>>
>>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry 
>>> wrote:
>>>
 Hi, folks.

 There currently seems to be a buzz around "data contracts". From what I
 can tell, these mainly advocate a cultural solution. But instead, could big
 data tools be used to enforce these contracts?

 My questions really are: are there any plans to implement data
 constraints in Spark (eg, an integer must be between 0 and 100; the date in
 column X must be before that in column Y)? And if not, is there an appetite
 for them?

 Maybe we could associate constraints with schema metadata that are
 enforced in the implementation of a FileFormatDataWriter?

 Just throwing it out there and wondering what other people think. It's
 an area that interests me as it seems that over half my problems at the day
 job are because of dodgy data.

 Regards,

 Phillip




Re: Data Contracts

2023-06-13 Thread Fokko Driesprong
Hey Phillip,

Thanks for raising this. I like the idea. The question is, should this be
implemented in Spark or some other framework? I know that dbt has a fairly
extensive way of testing your data
, and making sure that you
can enforce assumptions on the columns. The nice thing about dbt is that it
is built from a software engineering perspective, so all the tests (or
contracts) are living in version control. Using pull requests you could
collaborate on changing the contract and making sure that the change has
gotten enough attention before pushing it to production. Hope this helps!

Kind regards,
Fokko

Op di 13 jun 2023 om 04:31 schreef Deepak Sharma :

> Spark can be used with tools like great expectations as well to implement
> the data contracts .
> I am not sure though if spark alone can do the data contracts .
> I was reading a blog on data mesh and how to glue it together with data
> contracts , that’s where I came across this spark and great expectations
> mention .
>
> HTH
>
> -Deepak
>
> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West  wrote:
>
>> Hi Phillip,
>>
>> While not as fine-grained as your example, there do exist schema systems
>> such as that in Avro that can can evaluate compatible and incompatible
>> changes to the schema, from the perspective of the reader, writer, or both.
>> This provides some potential degree of enforcement, and means to
>> communicate a contract. Interestingly I believe this approach has been
>> applied to both JsonSchema and protobuf as part of the Confluent Schema
>> registry.
>>
>> Elliot.
>>
>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry 
>> wrote:
>>
>>> Hi, folks.
>>>
>>> There currently seems to be a buzz around "data contracts". From what I
>>> can tell, these mainly advocate a cultural solution. But instead, could big
>>> data tools be used to enforce these contracts?
>>>
>>> My questions really are: are there any plans to implement data
>>> constraints in Spark (eg, an integer must be between 0 and 100; the date in
>>> column X must be before that in column Y)? And if not, is there an appetite
>>> for them?
>>>
>>> Maybe we could associate constraints with schema metadata that are
>>> enforced in the implementation of a FileFormatDataWriter?
>>>
>>> Just throwing it out there and wondering what other people think. It's
>>> an area that interests me as it seems that over half my problems at the day
>>> job are because of dodgy data.
>>>
>>> Regards,
>>>
>>> Phillip
>>>
>>>