Re: [Reminder] Spark 3.5 Branch Cut

2023-07-16 Thread Herman van Hovell
Hi Yuanjian,

For the ongoing encoder work for the connect scala client I'd like to get
the following tickets in:

   - SPARK-44396  :
   Direct Arrow Deserialization
   - SPARK-9  :
   Upcasting for Arrow Deserialization
   - SPARK-44450  : Make
   direct Arrow encoding work with SQL/API.

Cheers,
Herman

On Sat, Jul 15, 2023 at 7:53 AM Enrico Minack 
wrote:

> Speaking of JdbcDialect, is there any interest in getting upserts for JDBC
> into 3.5.0?
>
> [SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC:
> https://github.com/apache/spark/pull/41518
> [SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC using
> MERGE INTO with temp table: https://github.com/apache/spark/pull/41611
>
> Enrico
>
>
> Am 15.07.23 um 04:10 schrieb Jia Fan:
>
> Can we put [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to
> JdbcDialect into 3.5.0?
> https://github.com/apache/spark/pull/41855
> Since this is the last major version update of 3.x, I think we need to
> make sure JdbcDialect can support more databases.
>
>
> Gengliang Wang  于2023年7月15日周六 05:20写道:
>
>> Hi Yuanjian,
>>
>> Besides the abovementioned changes, it would be great to include the UI
>> page for Spakr Connect: SPARK-44394
>> .
>>
>> Best Regards,
>> Gengliang
>>
>> On Fri, Jul 14, 2023 at 11:44 AM Julek Sompolski
>>   wrote:
>>
>>> Thank you,
>>> My changes that you listed are tracked under this Epic:
>>> https://issues.apache.org/jira/browse/SPARK-43754
>>> I am also working on https://issues.apache.org/jira/browse/SPARK-44422,
>>> didn't mention it before because I have hopes that this one will make it
>>> before the cut.
>>>
>>> (Unrelated) My colleague is also working on
>>> https://issues.apache.org/jira/browse/SPARK-43923 and I am reviewing
>>> https://github.com/apache/spark/pull/41443, so I hope that that one
>>> will also make it before the cut.
>>>
>>> Best regards,
>>> Juliusz Sompolski
>>>
>>> On Fri, Jul 14, 2023 at 7:34 PM Yuanjian Li 
>>> wrote:
>>>
 Hi everyone,
 As discussed earlier in "Time for Spark v3.5.0 release", I will cut
 branch-3.5 on *Monday, July 17th at 1 pm PST* as scheduled.

 Please plan your PR merge accordingly with the given timeline.
 Currently, we have received the following exception merge requests:

- SPARK-44421: Reattach to existing execute in Spark Connect
(server mechanism)
- SPARK-44423:  Reattach to existing execute in Spark Connect
(scala client)
- SPARK-44424:  Reattach to existing execute in Spark Connect
(python client)

 If there are any other exception feature requests, please reply to this
 email. We will not merge any new features in 3.5 after the branch cut.

 Best,
 Yuanjian

>>>
>


Re: Data Contracts

2023-07-16 Thread Phillip Henry
No worries. Have you had a chance to look at it?

Since this thread has gone dead, I assume there is no appetite for adding
data contract functionality..?

Regards,

Phillip


On Mon, 19 Jun 2023, 11:23 Deepak Sharma,  wrote:

> Sorry for using simple in my last email .
> It’s not gonna to be simple in any terms .
> Thanks for sharing the git Philip .
> Will definitely go through it .
>
> Thanks
> Deepak
>
> On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry 
> wrote:
>
>> I think it might be a bit more complicated than this (but happy to be
>> proved wrong).
>>
>> I have a minimum working example at:
>>
>> https://github.com/PhillHenry/SparkConstraints.git
>>
>> that runs out-of-the-box (mvn test) and demonstrates what I am trying to
>> achieve.
>>
>> A test persists a DataFrame that conforms to the contract and
>> demonstrates that one that does not, throws an Exception.
>>
>> I've had to slightly modify 3 Spark files to add the data contract
>> functionality. If you can think of a more elegant solution, I'd be very
>> grateful.
>>
>> Regards,
>>
>> Phillip
>>
>>
>>
>>
>> On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma 
>> wrote:
>>
>>> It can be as simple as adding a function to the spark session builder
>>> specifically on the read  which can take the yaml file(definition if data
>>> co tracts to be in yaml) and apply it to the data frame .
>>> It can ignore the rows not matching the data contracts defined in the
>>> yaml .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry 
>>> wrote:
>>>
 For my part, I'm not too concerned about the mechanism used to
 implement the validation as long as it's rich enough to express the
 constraints.

 I took a look at JSON Schemas (for which there are a number of JVM
 implementations) but I don't think it can handle more complex data types
 like dates. Maybe Elliot can comment on this?

 Ideally, *any* reasonable mechanism could be plugged in.

 But what struck me from trying to write a Proof of Concept was that it
 was quite hard to inject my code into this particular area of the Spark
 machinery. It could very well be due to my limited understanding of the
 codebase, but it seemed the Spark code would need a bit of a refactor
 before a component could be injected. Maybe people in this forum with
 greater knowledge in this area could comment?

 BTW, it's interesting to see that Databrick's "Delta Live Tables"
 appear to be attempting to implement data contracts within their ecosystem.
 Unfortunately, I think it's closed source and Python only.

 Regards,

 Phillip

 On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> It would be interesting if we think about creating a contract
> validation library written in JSON format. This would ensure a validation
> mechanism that will rely on this library and could be shared among 
> relevant
> parties. Will that be a starting point?
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin  wrote:
>
>> Hi,
>>
>> While I was at PayPal, we open sourced a template of Data Contract,
>> it is here: https://github.com/paypal/data-contract-template.
>> Companies like GX (Great Expectations) are interested in using it.
>>
>> Spark could read some elements form it pretty easily, like schema
>> validation, some rules validations. Spark could also generate an embryo 
>> of
>> data contracts…
>>
>> —jgp
>>
>>
>> On Jun 13, 2023, at 07:25, Mich Talebzadeh 
>> wrote:
>>
>> From my limited understanding of data contracts, there are two
>> factors that deem necessary.
>>
>>
>>1. procedure matter
>>2. technical matter
>>
>> I mean this is nothing new. Some tools like Cloud data fusion can
>> assist when the procedures are validated. Simply "The process of
>> integrating multiple data sources to produce more consistent, accurate, 
>> and
>> useful information than that provided by any individual data source.". In
>> the old time, we had staging tables that were used to clean and prune 
>> data