Re: Data Contracts

2023-06-14 Thread Jean-Georges Perrin
Hi,

While I was at PayPal, we open sourced a template of Data Contract, it is here: 
https://github.com/paypal/data-contract-template. Companies like GX (Great 
Expectations) are interested in using it.

Spark could read some elements form it pretty easily, like schema validation, 
some rules validations. Spark could also generate an embryo of data contracts…

—jgp


> On Jun 13, 2023, at 07:25, Mich Talebzadeh  wrote:
> 
> From my limited understanding of data contracts, there are two factors that 
> deem necessary.
> 
> procedure matter
> technical matter
> I mean this is nothing new. Some tools like Cloud data fusion can assist when 
> the procedures are validated. Simply "The process of integrating multiple 
> data sources to produce more consistent, accurate, and useful information 
> than that provided by any individual data source.". In the old time, we had 
> staging tables that were used to clean and prune data from multiple sources. 
> Nowadays we use the so-called Integration layer. If you use Spark as an ETL 
> tool, then you have to build this validation yourself. Case in point, how to 
> map customer_id from one source to customer_no from another. Legacy systems 
> are full of these anomalies. MDM can help but requires human intervention 
> which is time consuming. I am not sure the role of Spark here except being 
> able to read the mapping tables.
> 
> HTH
> 
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Tue, 13 Jun 2023 at 10:01, Phillip Henry  > wrote:
>> Hi, Fokko and Deepak.
>> 
>> The problem with DBT and Great Expectations (and Soda too, I believe) is 
>> that by the time they find the problem, the error is already in production - 
>> and fixing production can be a nightmare. 
>> 
>> What's more, we've found that nobody ever looks at the data quality reports 
>> we already generate.
>> 
>> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's 
>> usually against synthetic or at best sampled data (laws like GDPR generally 
>> stop personal information data being anywhere but prod).
>> 
>> What I'm proposing is something that stops production data ever being 
>> tainted.
>> 
>> Hi, Elliot.
>> 
>> Nice to see you again (we worked together 20 years ago)!
>> 
>> The problem here is that a schema itself won't protect me (at least as I 
>> understand your argument). For instance, I have medical records that say 
>> some of my patients are 999 years old which is clearly ridiculous but their 
>> age correctly conforms to an integer data type. I have other patients who 
>> were discharged before they were admitted to hospital. I have 28 patients 
>> out of literally millions who recently attended hospital but were discharged 
>> on 1/1/1900. As you can imagine, this made the average length of stay (a key 
>> metric for acute hospitals) much lower than it should have been. It only 
>> came to light when some average length of stays were negative! 
>> 
>> In all these cases, the data faithfully adhered to the schema.
>> 
>> Hi, Ryan.
>> 
>> This is an interesting point. There should indeed be a human connection but 
>> often there isn't. For instance, I have a friend who complained that his 
>> company's Zurich office made a breaking change and was not even aware that 
>> his London based department existed, never mind depended on their data. In 
>> large organisations, this is pretty common.
>> 
>> TBH, my proposal doesn't address this particular use case (maybe hooks and 
>> metastore listeners would...?) But my point remains that although these 
>> relationships should exist, in a sufficiently large organisation, they 
>> generally don't. And maybe we can help fix that with code?
>> 
>> Would love to hear further thoughts.
>> 
>> Regards,
>> 
>> Phillip
>> 
>> 
>> 
>> 
>> 
>> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong > > wrote:
>>> Hey Phillip,
>>> 
>>> Thanks for raising this. I like the idea. The question is, should this be 
>>> implemented in Spark or some other framework? I know that dbt has a fairly 
>>> extensive way of testing your data 
>>> , and making sure that you 
>>> can enforce assumptions on the columns. The nice thing about dbt is that it 
>>> is built from a software engineering perspective, so all the tests (or 
>>> contracts) are living in version control. Using pull 

Issues with Delta Lake on 3.0.0 preview + preview 2

2019-12-30 Thread Jean-Georges Perrin
Hi there,

Trying to run a very simple app saving content of a dataframe to Delta Lake. 
Code works great on 2.4.4 but fails on 3.0.0 preview & preview 2. Tried on both 
Delta Lake 0.5.0 and 0.4.0.

Code (I know, it’s amazing):

df.write().format("delta")
.mode("overwrite")
.save("/tmp/delta_grand_debat_events");

Exception raised:

Exception in thread "main" com.google.common.util.concurrent.ExecutionError: 
java.lang.NoSuchMethodError: 
org/apache/spark/util/Utils$.classForName(Ljava/lang/String;)Ljava/lang/Class; 
(loaded from 
file:/Users/jgp/.m2/repository/org/apache/spark/spark-core_2.12/3.0.0-preview/spark-core_2.12-3.0.0-preview.jar
 by sun.misc.Launcher$AppClassLoader@7da46134) called from interface 
org.apache.spark.sql.delta.storage.LogStoreProvider (loaded from 
file:/Users/jgp/.m2/repository/io/delta/delta-core_2.12/0.5.0/delta-core_2.12-0.5.0.jar
 by sun.misc.Launcher$AppClassLoader@7da46134).
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2199)
at com.google.common.cache.LocalCache.get(LocalCache.java:3934)
at 
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4736)
at org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:740)
at org.apache.spark.sql.delta.DeltaLog$.forTable(DeltaLog.scala:702)
at 
org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:126)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:71)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:69)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:87)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
at 
org.apache.spark.sql.execution.SparkPlan$$Lambda$1437.CC2CD020.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
at 
org.apache.spark.sql.execution.SparkPlan$$Lambda$1461.CFA16C20.apply(Unknown
 Source)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:109)
at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:829)
at 
org.apache.spark.sql.DataFrameWriter$$Lambda$2070.CFB38020.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
at 
org.apache.spark.sql.execution.SQLExecution$$$Lambda$1155.CF955820.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:829)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:309)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:236)
at 
net.jgp.books.spark.ch17.lab200_feed_delta.FeedDeltaLakeApp.start(FeedDeltaLakeApp.java:131)
at 
net.jgp.books.spark.ch17.lab200_feed_delta.FeedDeltaLakeApp.main(FeedDeltaLakeApp.java:29)
Caused by: java.lang.NoSuchMethodError: 
org/apache/spark/util/Utils$.classForName(Ljava/lang/String;)Ljava/lang/Class; 
(loaded from 
file:/Users/jgp/.m2/repository/org/apache/spark/spark-core_2.12/3.0.0-preview/spark-core_2.12-3.0.0-preview.jar
 by sun.misc.Launcher$AppClassLoader@7da46134) called from interface 
org.apache.spark.sql.delta.storage.LogStoreProvider (loaded from 
file:/Users/jgp/.m2/repository/io/delta/delta-core_2.12/0.5.0/delta-core_2.12-0.5.0.jar
 by sun.misc.Launcher$AppClassLoader@7da46134).
at 
org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore(LogStore.scala:122)
at 
org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore$(LogStore.scala:120)
at org.apache.spark.sql.delta.DeltaLog.createLogStore(DeltaLog.scala:58)
at 
org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore(LogStore.scala:117)
at 
org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore$(LogStore.scala:115)
at org.apache.spark.sql.delta.DeltaLog.createLogStore(DeltaLog.scala:58)
at 

Re: Issue with map Java lambda function with 3.0.0 preview and preview 2

2019-12-28 Thread Jean-Georges Perrin
Thanks Sean - yup, I was having issues with Scala 2.12 for some stuff, so I 
kept 2.11...

Casting works. Makes the code a little ugly, but… It’s definitely a Scala 2.12 
vs. 2.11, not a Spark 3 specifically.

jg

> On Dec 28, 2019, at 1:15 PM, Sean Owen  wrote:
> 
> Yes, it's necessary to cast the lambda in Java as (MapFunction)
> in many cases. This is because the Scala-specific and Java-specific
> versions of .map() both end up accepting a function object that the
> lambda can match, and an Encoder. What I'd have to go back and look up
> is why that would be different in Spark 3; some of that has always
> been the case with Java 8 in Spark 2. I think it might be related to
> Scala 2.12; were you using Spark 2 with Scala 2.11 before?
> 
> On Sat, Dec 28, 2019 at 11:38 AM Jean-Georges Perrin  wrote:
>> 
>> Hey guys,
>> 
>> This code:
>> 
>>   Dataset incrementalDf = spark
>>   .createDataset(l, Encoders.INT())
>>   .toDF();
>>   Dataset dotsDs = incrementalDf
>>   .map(status -> {
>> double x = Math.random() * 2 - 1;
>> double y = Math.random() * 2 - 1;
>> counter++;
>> if (counter % 10 == 0) {
>>   System.out.println("" + counter + " darts thrown so far");
>> }
>> return (x * x + y * y <= 1) ? 1 : 0;
>>   }, Encoders.INT());
>> 
>> used to work with Spark 2.x, in the two previous, it says:
>> 
>> The method map(Function1, Encoder) is ambiguous for 
>> the type Dataset
>> 
>> IfI define my mapping function as a class it works fine. Here is the class:
>> 
>> private final class DartMapper
>> implements MapFunction {
>>   private static final long serialVersionUID = 38446L;
>> 
>>   @Override
>>   public Integer call(Row r) throws Exception {
>> double x = Math.random() * 2 - 1;
>> double y = Math.random() * 2 - 1;
>> counter++;
>> if (counter % 1000 == 0) {
>>   System.out.println("" + counter + " operations done so far");
>> }
>> return (x * x + y * y <= 1) ? 1 : 0;
>>   }
>> }
>> 
>> Any hint on what/if I did wrong?
>> 
>> jg
>> 
>> 
>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Issue with map Java lambda function with 3.0.0 preview and preview 2

2019-12-28 Thread Jean-Georges Perrin
I forgot… it does the same thing with the reducer…

int dartsInCircle = dotsDs.reduce((x, y) -> x + y);

jg

> On Dec 28, 2019, at 12:38 PM, Jean-Georges Perrin  wrote:
> 
> Hey guys,
> 
> This code:
> 
> Dataset incrementalDf = spark
> .createDataset(l, Encoders.INT())
> .toDF();
> Dataset dotsDs = incrementalDf
> .map(status -> {
>   double x = Math.random() * 2 - 1;
>   double y = Math.random() * 2 - 1;
>   counter++;
>   if (counter % 10 == 0) {
> System.out.println("" + counter + " darts thrown so far");
>   }
>   return (x * x + y * y <= 1) ? 1 : 0;
> }, Encoders.INT());
> 
> used to work with Spark 2.x, in the two previous, it says:
> 
> The method map(Function1, Encoder) is ambiguous for the 
> type Dataset
> 
> IfI define my mapping function as a class it works fine. Here is the class:
> 
>   private final class DartMapper
>   implements MapFunction {
> private static final long serialVersionUID = 38446L;
> 
> @Override
> public Integer call(Row r) throws Exception {
>   double x = Math.random() * 2 - 1;
>   double y = Math.random() * 2 - 1;
>   counter++;
>   if (counter % 1000 == 0) {
> System.out.println("" + counter + " operations done so far");
>   }
>   return (x * x + y * y <= 1) ? 1 : 0;
> }
>   }
> 
> Any hint on what/if I did wrong? 
> 
> jg
> 
> 
> 



Issue with map Java lambda function with 3.0.0 preview and preview 2

2019-12-28 Thread Jean-Georges Perrin
Hey guys,

This code:

Dataset incrementalDf = spark
.createDataset(l, Encoders.INT())
.toDF();
Dataset dotsDs = incrementalDf
.map(status -> {
  double x = Math.random() * 2 - 1;
  double y = Math.random() * 2 - 1;
  counter++;
  if (counter % 10 == 0) {
System.out.println("" + counter + " darts thrown so far");
  }
  return (x * x + y * y <= 1) ? 1 : 0;
}, Encoders.INT());

used to work with Spark 2.x, in the two previous, it says:

The method map(Function1, Encoder) is ambiguous for the 
type Dataset

IfI define my mapping function as a class it works fine. Here is the class:

  private final class DartMapper
  implements MapFunction {
private static final long serialVersionUID = 38446L;

@Override
public Integer call(Row r) throws Exception {
  double x = Math.random() * 2 - 1;
  double y = Math.random() * 2 - 1;
  counter++;
  if (counter % 1000 == 0) {
System.out.println("" + counter + " operations done so far");
  }
  return (x * x + y * y <= 1) ? 1 : 0;
}
  }

Any hint on what/if I did wrong? 

jg





Re: Thoughts on Spark 3 release, or a preview release

2019-09-11 Thread Jean Georges Perrin
As a user/non committer, +1

I love the idea of an early 3.0.0 so we can test current dev against it, I know 
the final 3.x will probably need another round of testing when it gets out, but 
less for sure... I know I could checkout and compile, but having a “packaged” 
preversion is great if it does not take too much time to the team...

jg


> On Sep 11, 2019, at 20:40, Hyukjin Kwon  wrote:
> 
> +1 from me too but I would like to know what other people think too.
> 
> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun 님이 작성:
>> Thank you, Sean.
>> 
>> I'm also +1 for the following three.
>> 
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>> 
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it a 
>> lot.
>> 
>> After this discussion, can we have some timeline for `Spark 3.0 Release 
>> Window` in our versioning-policy page?
>> 
>> - https://spark.apache.org/versioning-policy.html
>> 
>> Bests,
>> Dongjoon.
>> 
>> 
>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer  wrote:
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility problems 
>>> resolved, e.g.
>>> 
>>> https://issues.apache.org/jira/browse/SPARK-25588
>>> https://issues.apache.org/jira/browse/SPARK-27781
>>> 
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as 
>>> I know, Parquet has not cut a release based on this new version.
>>> 
>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>> 
>>> https://github.com/apache/spark/pull/24851
>>> https://github.com/apache/spark/pull/24297
>>> 
>>>michael
>>> 
>>> 
 On Sep 11, 2019, at 1:37 PM, Sean Owen  wrote:
 
 I'm curious what current feelings are about ramping down towards a
 Spark 3 release. It feels close to ready. There is no fixed date,
 though in the past we had informally tossed around "back end of 2019".
 For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
 Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
 due.
 
 What are the few major items that must get done for Spark 3, in your
 opinion? Below are all of the open JIRAs for 3.0 (which everyone
 should feel free to update with things that aren't really needed for
 Spark 3; I already triaged some).
 
 For me, it's:
 - DSv2?
 - Finishing touches on the Hive, JDK 11 update
 
 What about considering a preview release earlier, as happened for
 Spark 2, to get feedback much earlier than the RC cycle? Could that
 even happen ... about now?
 
 I'm also wondering what a realistic estimate of Spark 3 release is. My
 guess is quite early 2020, from here.
 
 
 
 SPARK-29014 DataSourceV2: Clean up current, default, and session catalog 
 uses
 SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
 SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
 SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
 SPARK-28588 Build a SQL reference doc
 SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
 SPARK-28684 Hive module support JDK 11
 SPARK-28548 explain() shows wrong result for persisted DataFrames
 after some operations
 SPARK-28372 Document Spark WEB UI
 SPARK-28476 Support ALTER DATABASE SET LOCATION
 SPARK-28264 Revisiting Python / pandas UDF
 SPARK-28301 fix the behavior of table name resolution with multi-catalog
 SPARK-28155 do not leak SaveMode to file source v2
 SPARK-28103 Cannot infer filters from union table with empty local
 relation table properly
 SPARK-28024 Incorrect numeric values when out of range
 SPARK-27936 Support local dependency uploading from --py-files
 SPARK-27884 Deprecate Python 2 support in Spark 3.0
 SPARK-27763 Port test cases from PostgreSQL to Spark SQL
 SPARK-27780 Shuffle server & client should be versioned to enable
 smoother upgrade
 SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
 of joined tables > 12
 SPARK-27471 Reorganize public v2 catalog API
 SPARK-27520 Introduce a global config system to replace hadoopConfiguration
 SPARK-24625 put all the backward compatible behavior change configs
 under spark.sql.legacy.*
 SPARK-24640 size(null) returns null
 SPARK-24702 Unable to cast to calendar interval in spark sql.
 SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
 SPARK-24941 Add RDDBarrier.coalesce() function
 SPARK-25017 Add test suite for ContextBarrierState
 SPARK-25083 remove the type erasure hack in data source scan
 SPARK-25383 Image data source supports sample pushdown
 SPARK-27272 Enable blacklisting of node/executor on fetch failures by 
 default
 SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
 efficiency problem
 SPARK-25128 

Re: [DISCUSSION]JDK11 for Apache 2.x?

2019-08-27 Thread Jean Georges Perrin
Not a contributor, but a user perspective…

As Spark 3.x will be an evolution, I am not completely shocked that it would 
imply a Java 11 requirement as well. Would be great to have both Java 8 and 
Java 11, but one needs to be able to say goodbye. Java 8 is great, still using 
it actively in production, but we know its time is limited, so, by the time we 
evolve to Spark 3, we could combine it with Java 11.

On the other hand, not everybody may think this way and it may slow down the 
adoption of Spark 3…

However, I concur with Sean, I don’t think another 2.x is needed for Java 11.

> On Aug 27, 2019, at 3:09 PM, Sean Owen  wrote:
> 
> I think one of the key problems here are the required dependency
> upgrades. It would mean many minor breaking changes and a few bigger
> ones, notably around Hive, and forces a scala 2.12-only update. I
> think my question is whether that even makes sense as a minor release?
> it wouldn't be backwards compatible with 2.4 enough to call it a
> low-risk update. It would be a smaller step than moving all the way to
> 3.0, sure. I am not super against it, but we have to keep in mind how
> much work it would then be to maintain two LTS 2.x releases, 2.4 and
> the sort-of-compatible 2.5, while proceeding with 3.x.
> 
> On Tue, Aug 27, 2019 at 2:01 PM DB Tsai  wrote:
>> 
>> Hello everyone,
>> 
>> Thank you all for working on supporting JDK11 in Apache Spark 3.0 as a 
>> community.
>> 
>> Java 8 is already end of life for commercial users, and many companies are 
>> moving to Java 11.
>> The release date for Apache Spark 3.0 is still not there yet, and there are 
>> many API
>> incompatibility issues when upgrading from Spark 2.x. As a result, asking 
>> users to move to
>> Spark 3.0 to use JDK 11 is not realistic.
>> 
>> Should we backport PRs for JDK11 and cut a release in 2.x to support JDK11?
>> 
>> Should we cut a new Apache Spark 2.5 since the patches involve some of the 
>> dependencies changes
>> which is not desired in minor release?
>> 
>> Thanks.
>> 
>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, 
>> Inc
>> 
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Spark in Action, 2e...

2019-07-10 Thread Jean-Georges Perrin
Hi Spark contributors,


As some of you know, I am writing a book on Spark (and it’s coming to an end 
soon), published by Manning, called Spark in Action, 2e. I am pretty sure you 
are not the target audience for this book as it is more tailored for the data 
engineer and architect who is starting their journey along Spark, but it 
contains a lot of non-Spark info and best practices as well.

If you are interested, I am happy to send you a copy - no strings attached, you 
don’t have to promote it, you don’t have to like it, you don’t have to give me 
any feedback. If you do any of that, it’s great, but really not necessary. If 
you learn something, that’s my pleasure. It’s a basic (maybe ridiculous) way to 
say thank you for this awesome tool you are building.

If you’d like your copy, just shoot me an email at jgp [at] oplo [dot] io (or 
on twitter @jgperrin, or on LinkedIn http://linkedin.com/in/jgperrin 
<http://linkedin.com/in/jgperrin>)

Thanks for building Spark!


Jean-Georges Perrin
j...@jgp.net






Re: DataSourceV2 sync, 17 April 2019

2019-04-27 Thread Jean Georges Perrin
This may be completely inappropriate and I apologize if it is, nevertheless, I 
am trying to get some clarification about the current status of DS.

Please tell me where I am wrong:

Currently, the stable API is v1.
There is a v2 DS API, but it is not widely used.
The group is working on a “new” v2 API that will be available after the release 
of Spark v3.

jg

--
Jean Georges Perrin
j...@jgp.net



> On Apr 19, 2019, at 10:10, Ryan Blue  wrote:
> 
> Here are my notes from the last DSv2 sync. As always:
> 
> If you’d like to attend the sync, send me an email and I’ll add you to the 
> invite. Everyone is welcome.
> These notes are what I wrote down and remember. If you have corrections or 
> comments, please reply.
> Topics:
> 
> TableCatalog PR #24246: https://github.com/apache/spark/pull/24246 
> <https://github.com/apache/spark/pull/24246>
> Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233 
> <https://github.com/apache/spark/pull/24233>
> Streaming capabilities PR #24129: https://github.com/apache/spark/pull/24129 
> <https://github.com/apache/spark/pull/24129>
> Attendees:
> 
> Ryan Blue
> John Zhuge
> Matt Cheah
> Yifei Huang
> Bruce Robbins
> Jamison Bennett
> Russell Spitzer
> Wenchen Fan
> Yuanjian Li
> 
> (and others who arrived after the start)
> 
> Discussion:
> 
> TableCatalog PR: https://github.com/apache/spark/pull/24246 
> <https://github.com/apache/spark/pull/24246>
> Wenchen and Matt had just reviewed the PR. Mostly what was in the SPIP so not 
> much discussion of content.
> Wenchen: Easier to review if the changes to move Table and TableCapability 
> were in a separate PR (mostly import changes)
> Ryan will open a separate PR for the move [Ed: #24410]
> Russell: How should caching work? Has hit lots of problems with Spark caching 
> data and getting out of date
> Ryan: Spark should always call into the catalog and not cache to avoid those 
> problems. However, Spark should ensure that it uses the same instance of a 
> Table for all scans in the same query, for consistent self-joins.
> Some discussion of self joins. Conclusion was that we don’t need to worry 
> about this yet because it is unlikely.
> Wenchen: should this include the namespace methods?
> Ryan: No, those are a separate concern and can be added in a parallel PR.
> Remove SaveMode PR: https://github.com/apache/spark/pull/24233 
> <https://github.com/apache/spark/pull/24233>
> Wenchen: PR is on hold waiting for streaming capabilities, #24129, because 
> the Noop sink doesn’t validate schema
> Wenchen will open a PR to add a capability to opt out of schema validation, 
> then come back to this PR.
> Streaming capabilities PR: https://github.com/apache/spark/pull/24129 
> <https://github.com/apache/spark/pull/24129>
> Ryan: This PR needs validation in the analyzer. The analyzer is where 
> validations should exist, or else validations must be copied into every code 
> path that produces a streaming plan.
> Wenchen: the write check can’t be written because the write node is never 
> passed to the analyzer. Fixing that is a larger problem.
> Ryan: Agree that refactoring to pass the write node to the analyzer should be 
> separate.
> Wenchen: a check to ensure that either microbatch or continuous can be used 
> is hard because some sources may fall back
> Ryan: By the time this check runs, fallback has happened. Do v1 sources 
> support continuous mode?
> Wenchen: No, v1 doesn’t support continuous
> Ryan: Then this can be written to assume that v1 sources only support 
> microbatch mode.
> Wenchen will add this check
> Wenchen: the check that tables in a v2 streaming relation support either 
> microbatch or continuous won’t catch anything and are unnecessary
> Ryan: These checks still need to be in the analyzer so future uses do not 
> break. We had the same problem moving to v2: because schema checks were 
> specific to DataSource code paths, they were overlooked when adding v2. 
> Running validations in the analyzer avoids problems like this.
> Wenchen will add the validation.
> Matt: Will v2 be ready in time for the 3.0 release?
> Ryan: Once #24246 is in, we can work on PRs in parallel, but it is not 
> looking good.
> -- 
> Ryan Blue
> Software Engineer
> Netflix



Re: [RESULT] [VOTE] Functional DataSourceV2 in Spark 3.0

2019-03-03 Thread Jean Georges Perrin
Hi, I am kind of new at the whole Apache process (not specifically Spark). Does 
that means that the DataSourceV2 is dead or stays experimental? Thanks for 
clarifying for a newbie. 

jg


> On Mar 3, 2019, at 11:21, Ryan Blue  wrote:
> 
> This vote fails with the following counts:
> 
> 3 +1 votes:
> 
> Matt Cheah
> Ryan Blue
> Sean Owen (binding)
> 1 -0 vote:
> 
> Jose Torres
> 2 -1 votes:
> 
> Mark Hamstra (binding)
> Midrul Muralidharan (binding)
> Thanks for the discussion, everyone, It sounds to me that the main objection 
> is simply that we’ve already committed to a release that removes deprecated 
> APIs and we don’t want to commit to features at the same time. While I’m a 
> bit disappointed, I think that’s a reasonable position for the community to 
> take and at least is a clear result.
> 
> rb
> 
>> On Thu, Feb 28, 2019 at 8:38 AM Ryan Blue rb...@netflix.com wrote:
>> 
>> I’d like to call a vote for committing to getting DataSourceV2 in a 
>> functional state for Spark 3.0.
>> 
>> For more context, please see the discussion thread, but here is a quick 
>> summary about what this commitment means:
>> 
>> We think that a “functional DSv2” is an achievable goal for the Spark 3.0 
>> release
>> We will consider this a blocker for Spark 3.0, and take reasonable steps to 
>> make it happen
>> We will not delay the release without a community discussion
>> Here’s what we’ve defined as a functional DSv2:
>> 
>> Add a plugin system for catalogs
>> Add an interface for table catalogs (see the ongoing SPIP vote)
>> Add an implementation of the new interface that calls SessionCatalog to load 
>> v2 tables
>> Add a resolution rule to load v2 tables from the v2 catalog
>> Add CTAS logical and physical plan nodes
>> Add conversions from SQL parsed plans to v2 logical plans (e.g., INSERT INTO 
>> support)
>> Please vote in the next 3 days on whether you agree with committing to this 
>> goal.
>> 
>> [ ] +1: Agree that we should consider a functional DSv2 implementation a 
>> blocker for Spark 3.0
>> [ ] +0: . . .
>> [ ] -1: I disagree with this goal because . . .
>> 
>> Thank you!
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: Static functions

2019-02-15 Thread Jean Georges Perrin
Hey Jacek,

You mean:
 * @groupname udf_funcs UDF functions
 * @groupname agg_funcs Aggregate functions
 * @groupname datetime_funcs Date time functions
 * @groupname sort_funcs Sorting functions
 * @groupname normal_funcs Non-aggregate functions
 * @groupname math_funcs Math functions
 * @groupname misc_funcs Misc functions
 * @groupname window_funcs Window functions
 * @groupname string_funcs String functions
 * @groupname collection_funcs Collection functions
 * @groupname Ungrouped Support functions for DataFrames

There’s that (and thanks, I did not know), but it does not show on the Javadoc 
in 
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html…
 or anywhere else (Scala doc or SQL functions).


Jean Georges Perrin
j...@jgp.net



> On Feb 11, 2019, at 09:42, Jacek Laskowski  wrote:
> 
> Hi Jean,
> 
> I thought the functions have already been tagged?
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski <https://about.me/JacekLaskowski>
> Mastering Spark SQL https://bit.ly/mastering-spark-sql 
> <https://bit.ly/mastering-spark-sql>
> Spark Structured Streaming https://bit.ly/spark-structured-streaming 
> <https://bit.ly/spark-structured-streaming>
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams 
> <https://bit.ly/mastering-kafka-streams>
> Follow me at https://twitter.com/jaceklaskowski
>  <https://twitter.com/jaceklaskowski>
> 
> On Sun, Feb 10, 2019 at 11:48 PM Jean Georges Perrin  <mailto:j...@jgp.net>> wrote:
> Hey guys,
> 
> We have 381 static functions now (including the deprecated). I am trying to 
> sort them out by group/tag them.
> 
> So far, I have:
> Array
> Conversion
> Date
> Math
> Trigo (sub group of maths)
> Security
> Streaming
> String
> Technical
> Do you see more categories? Tags?
> 
> Thanks!
> 
> jg
> 
> —
> Jean Georges Perrin / @jgperrin
> 



Static functions

2019-02-10 Thread Jean Georges Perrin
Hey guys,

We have 381 static functions now (including the deprecated). I am trying to 
sort them out by group/tag them.

So far, I have:
Array
Conversion
Date
Math
Trigo (sub group of maths)
Security
Streaming
String
Technical
Do you see more categories? Tags?

Thanks!

jg

—
Jean Georges Perrin / @jgperrin



Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Jean Georges Perrin
Awesome - thanks Dongjoon!

> On Oct 10, 2018, at 2:36 PM, Dongjoon Hyun  wrote:
> 
> For now, you can see generated release notes. Official one will be posted on 
> the website when the official 2.4.0 is out.
> 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420=12342385
>  
> <https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420=12342385>
> 
> Bests,
> Dongjoon.
> 
> 
> On Wed, Oct 10, 2018 at 11:29 AM Jean Georges Perrin  <mailto:j...@jgp.net>> wrote:
> Hi,
> 
> Sorry if it's stupid question, but where can I find the release notes of 
> 2.4.0?
> 
> jg
> 
>> On Oct 10, 2018, at 2:00 PM, Imran Rashid > <mailto:iras...@cloudera.com.INVALID>> wrote:
>> 
>> Sorry I had messed up my testing earlier, so I only just discovered 
>> https://issues.apache.org/jira/browse/SPARK-25704 
>> <https://issues.apache.org/jira/browse/SPARK-25704>
>> 
>> I dont' think this is a release blocker, because its not a regression and 
>> there is a workaround, just fyi.
>> 
>> On Wed, Oct 10, 2018 at 11:47 AM Wenchen Fan > <mailto:cloud0...@gmail.com>> wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 2.4.0.
>> 
>> The vote is open until October 1 PST and passes if a majority +1 PMC votes 
>> are cast, with
>> a minimum of 3 +1 votes.
>> 
>> [ ] +1 Release this package as Apache Spark 2.4.0
>> [ ] -1 Do not release this package because ...
>> 
>> To learn more about Apache Spark, please see http://spark.apache.org/ 
>> <http://spark.apache.org/>
>> 
>> The tag to be voted on is v2.4.0-rc3 (commit 
>> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
>> https://github.com/apache/spark/tree/v2.4.0-rc3 
>> <https://github.com/apache/spark/tree/v2.4.0-rc3>
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/ 
>> <https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/>
>> 
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS 
>> <https://dist.apache.org/repos/dist/dev/spark/KEYS>
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1289 
>> <https://repository.apache.org/content/repositories/orgapachespark-1289>
>> 
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/ 
>> <https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/>
>> 
>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12342385 
>> <https://issues.apache.org/jira/projects/SPARK/versions/12342385>
>> 
>> FAQ
>> 
>> =
>> How can I help test this release?
>> =
>> 
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>> 
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>> 
>> ===
>> What should happen to JIRA tickets still targeting 2.4.0?
>> ===
>> 
>> The current list of open tickets targeted at 2.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK 
>> <https://issues.apache.org/jira/projects/SPARK> and search for "Target 
>> Version/s" = 2.4.0
>> 
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>> 
>> ==
>> But my bug isn't fixed?
>> ==
>> 
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
> 



Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-10 Thread Jean Georges Perrin
Hi,

Sorry if it's stupid question, but where can I find the release notes of 2.4.0?

jg

> On Oct 10, 2018, at 2:00 PM, Imran Rashid  > wrote:
> 
> Sorry I had messed up my testing earlier, so I only just discovered 
> https://issues.apache.org/jira/browse/SPARK-25704 
> 
> 
> I dont' think this is a release blocker, because its not a regression and 
> there is a workaround, just fyi.
> 
> On Wed, Oct 10, 2018 at 11:47 AM Wenchen Fan  > wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.0.
> 
> The vote is open until October 1 PST and passes if a majority +1 PMC votes 
> are cast, with
> a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/ 
> 
> 
> The tag to be voted on is v2.4.0-rc3 (commit 
> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
> https://github.com/apache/spark/tree/v2.4.0-rc3 
> 
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/ 
> 
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS 
> 
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1289 
> 
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/ 
> 
> 
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385 
> 
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
> 
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
> 
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK 
>  and search for "Target 
> Version/s" = 2.4.0
> 
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
> 
> ==
> But my bug isn't fixed?
> ==
> 
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.