Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
gt; think. I'm still not convinced there is a burning need to use Java 11 > but stay on 2.4, after 3.0 is out, and at least the wheels are in > motion there. Java 8 is still free and being updated. > > On Fri, Sep 20, 2019 at 12:48 PM Ryan Blue > wrote: > > > > H

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
gt; On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue > wrote: > >> Hi everyone, >> >> In the DSv2 sync this week, we talked about a possible Spark 2.5 release >> based on the latest Spark 2.4, but with DSv2 and Java 11 support added. >> >> A Spark 2.5

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Ryan Blue
; >> > Here, I am proposing to cut the branch on October 15th. If the features >> are targeting to 3.0 preview release, please prioritize the work and finish >> it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. >> That means, the community will still work on the features for the upcoming >> Spark 3.0 release, even if they are not included in the preview release. >> The goal of preview release is to collect more feedback from the community >> regarding the new 3.0 features/behavior changes. >> > >> > Thanks! >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

[DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release. Would a Spark 2.5 release help anyone else? Are there any concerns about this plan? rb -- Ryan Blue Software Engineer Netflix

DSv2 sync notes - 28 September 2019

2019-09-20 Thread Ryan Blue
Here are my notes from this week’s DSv2 sync. *Attendees*: Ryan Blue Holden Karau Russell Spitzer Terry Kim Wenchen Fan Shiv Prashant Sood Joseph Torres Gengliang Wang Matt Cheah Burak Yavuz *Topics*: - Driver-side Hadoop conf - SHOW DATABASES/NAMESPACES behavior - Review outstanding

Re: Thoughts on Spark 3 release, or a preview release

2019-09-13 Thread Ryan Blue
gt;>> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function >>> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState >>> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan >>> >>>>>>> SPARK-25383 Image data source supports sample pushdown >>> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch >>> failures by default >>> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a >>> major >>> >>>>>>> efficiency problem >>> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s >>> backend >>> >>>>>>> cause driver pods to hang >>> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins >>> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale >>> configurable >>> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode >>> >>>>>>> SPARK-24942 Improve cluster resource management with jobs >>> containing >>> >>>>>>> barrier stage >>> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in >>> logical Aggregate >>> >>>>>>> SPARK-26022 PySpark Comparison with Pandas >>> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL >>> standard >>> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics >>> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source >>> to >>> >>>>>>> avoid checkpoint corruption >>> >>>>>>> SPARK-25843 Redesign rangeBetween API >>> >>>>>>> SPARK-25841 Redesign window function rangeBetween API >>> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that >>> >>>>>>> produce named output from CleanupAliases >>> >>>>>>> SPARK-23210 Introduce the concept of default value to schema >>> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and >>> window aggregate >>> >>>>>>> SPARK-25531 new write APIs for data source v2 >>> >>>>>>> SPARK-25547 Pluggable jdbc connection factory >>> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO >>> >>>>>>> SPARK-24417 Build and Run Spark on JDK11 >>> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode + >>> Kubernetes >>> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + >>> Mesos >>> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in >>> >>>>>>> MesosFineGrainedSchedulerBackend >>> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>> >>>>>>> SPARK-25186 Stabilize Data Source V2 API >>> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for >>> barrier >>> >>>>>>> execution mode >>> >>>>>>> SPARK-25390 data source V2 API refactoring >>> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public >>> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based >>> Partition Spec >>> >>>>>>> SPARK-15691 Refactor and improve Hive support >>> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core >>> >>>>>>> SPARK-16217 Support SELECT INTO statement >>> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support >>> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working >>> >>>>>>> SPARK-18245 Improving support for bucketed table >>> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints >>> Support in Spark >>> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in >>> nested >>> >>>>>>> list of structures >>> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's >>> DataFrame to >>> >>>>>>> respect session timezone >>> >>>>>>> SPARK-22386 Data Source V2 improvements >>> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + >>> YARN >>> >>>>>>> >>> >>>>>>> >>> - >>> >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>>>>> >>> >>>>>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Name : Jungtaek Lim >>> >>>> Blog : http://medium.com/@heartsavior >>> >>>> Twitter : http://twitter.com/heartsavior >>> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> John Zhuge >>> >> >>> >> >>> >> >>> >> -- >>> >> Twitter: https://twitter.com/holdenkarau >>> >> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 >>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> > >>> > >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> -- Ryan Blue Software Engineer Netflix

Re: [VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-09-11 Thread Ryan Blue
+0 > [ ] -1: I don't think this is a good idea because ... > > I'll start with my +1 > > Thanks, > Tom > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-06 Thread Ryan Blue
am DBMS is using this policy by >> default. >> >> Currently, the V1 data source uses "Legacy" policy by default, while V2 uses >> "Strict". This proposal is to use "ANSI" policy by default for both V1 and >> V2 in Spark 3.0. >> >> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in the >> dev mailing list. >> >> This vote is open until next Thurs (Sept. 12nd). >> >> [ ] +1: Accept the proposal >> [ ] +0 >> [ ] -1: I don't think this is a good idea because ... >> >> Thank you! >> >> Gengliang >> >> > -- Ryan Blue Software Engineer Netflix

DSv2 sync - 4 September 2019

2019-09-06 Thread Ryan Blue
Here are my notes from the latest sync. Feel free to reply with clarifications if I’ve missed anything. *Attendees*: Ryan Blue John Zhuge Russell Spitzer Matt Cheah Gengliang Wang Priyanka Gomatam Holden Karau *Topics*: - DataFrameWriterV2 insert vs append (recap) - ANSI and strict modes

DSv2 sync notes - 21 August 2019

2019-08-30 Thread Ryan Blue
Sorry these notes were delayed. Here’s what we talked about in the last DSv2 sync. *Attendees*: Ryan Blue John Zhuge Burak Yavuz Gengliang Wang Terry Kim Wenchen Fan Xin Ren Srabasti Banerjee Priyanka Gomatam *Topics*: - Follow up on renaming append to insert in v2 API - Changes

Re: [DISCUSS] ViewCatalog interface for DSv2

2019-08-13 Thread Ryan Blue
ime >- softwareVersion >- options (map) > > ViewColumn interface: > >- name >- type > > > Thanks, > John Zhuge > -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes - 24 July 2019

2019-08-06 Thread Ryan Blue
Here are my notes from the last DSv2 sync. Sorry it's a bit late! *Attendees*: Ryan Blue John Zhuge Raynmond McCollum Terry Kim Gengliang Wang Jose Torres Wenchen Fan Priyanka Gomatam Matt Cheah Russel Spitzer Burak Yavuz *Topics*: - Check in on blockers - Remove SaveMode

Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-05 Thread Ryan Blue
. My intuition is yes, because >> different users have different levels of tolerance for different kinds of >> errors. I’d expect these sorts of configurations to be set up at an >> infrastructure level, e.g. to maintain consistent standards throughout a >> whole organiz

Re: DataSourceV2 : Transactional Write support

2019-08-03 Thread Ryan Blue
in advance for your help. >> >> Regards, >> Shiv >> > > > -- > Name : Jungtaek Lim > Blog : http://medium.com/@heartsavior > Twitter : http://twitter.com/heartsavior > LinkedIn : http://www.linkedin.com/in/heartsavior > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Ryan Blue
; >> >> -Matt Cheah >> >> >> >> *From: *Reynold Xin >> *Date: *Wednesday, July 31, 2019 at 9:58 AM >> *To: *Matt Cheah >> *Cc: *Russell Spitzer , Takeshi Yamamuro < >> linguin@gmail.com>, Gengliang Wang , >> Ryan Blue ,

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-26 Thread Ryan Blue
SQL is a better idea. >> For more information, please read the Discuss: Follow ANSI SQL on table >> insertion >> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing> >> Please let me know if you have any thoughts on this. >> >> Regards, >> Gengliang >> > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 sync notes - 10 July 2019

2019-07-23 Thread Ryan Blue
rt data source v2 performance a lot and we'd better > fix it sooner rather than later. > > > On Sat, Jul 20, 2019 at 8:20 AM Ryan Blue > wrote: > >> Here are my notes from the last sync. If you’d like to be added to the >> invite or have topics, please let me know. >&g

DataSourceV2 sync notes - 10 July 2019

2019-07-19 Thread Ryan Blue
Here are my notes from the last sync. If you’d like to be added to the invite or have topics, please let me know. *Attendees*: Ryan Blue Matt Cheah Yifei Huang Jose Torres Burak Yavuz Gengliang Wang Michael Artz Russel Spitzer *Topics*: - Existing PRs - V2 session catalog: https

Re: JDBC connector for DataSourceV2

2019-07-12 Thread Ryan Blue
Sounds great! Ping me on the review, I think this will be really valuable. On Fri, Jul 12, 2019 at 6:51 PM Xianyin Xin wrote: > If there’s nobody working on that, I’d like to contribute. > > > > Loop in @Gengliang Wang. > > > > Xianyin > > > > *F

Re: JDBC connector for DataSourceV2

2019-07-12 Thread Ryan Blue
Master, > but can't find a JDBC implementation or related JIRA. > > DatasourceV2 APIs to me look in good shape to attempt a JDBC connector for > READ/WRITE path. > > Thanks & Regards, > Shiv > -- Ryan Blue Software Engineer Netflix

DSv2 sync notes - 26 June 2019

2019-06-28 Thread Ryan Blue
Here are my notes from this week’s sync. *Attendees*: Ryan Blue John Zhuge Dale Richardson Gabor Somogyi Matt Cheah Yifei Huang Xin Ren Jose Torres Gengliang Wang Kevin Yu *Topics*: - Metadata columns or function push-down for Kafka v2 source - Open PRs - REPLACE TABLE

Re: Timeline for Spark 3.0

2019-06-28 Thread Ryan Blue
in terms of the first RC and final > release? > > > > > > > > Cheers Andrew > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: DSv1 removal

2019-06-20 Thread Ryan Blue
(in case of DSv2 replacement is implemented). After some > digging I've found DSv1 sources which are already removed but in some cases > v1 and v2 still exists in parallel. > > Can somebody please tell me what's the overall plan in this area? > > BR, > G > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-17 Thread Ryan Blue
>>>> >>>>>> >>>>>> >>>>>> I would like to call a vote for the SPIP for SPARK-25299 >>>>>> <https://issues.apache.org/jira/browse/SPARK-25299>, which proposes >>>>>> to introduce a pluggable storage API for temporary shuffle data. >>>>>> >>>>>> >>>>>> >>>>>> You may find the SPIP document here >>>>>> <https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit> >>>>>> . >>>>>> >>>>>> >>>>>> >>>>>> The discussion thread for the SPIP was conducted here >>>>>> <https://lists.apache.org/thread.html/2fe82b6b86daadb1d2edaef66a2d1c4dd2f45449656098ee38c50079@%3Cdev.spark.apache.org%3E> >>>>>> . >>>>>> >>>>>> >>>>>> >>>>>> Please vote on whether or not this proposal is agreeable to you. >>>>>> >>>>>> >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> -Matt Cheah >>>>>> >>>>> -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes - 12 June 2019

2019-06-14 Thread Ryan Blue
Here are the latest DSv2 sync notes. Please reply with updates or corrections. *Attendees*: Ryan Blue Michael Armbrust Gengliang Wang Matt Cheah John Zhuge *Topics*: Wenchen’s reorganization proposal Problems with TableProvider - property map isn’t sufficient New PRs: - ReplaceTable

DataSourceV2 sync notes - 29 May 2019

2019-05-30 Thread Ryan Blue
Here are my notes from last night’s sync. I had to leave early, so there may be more discussion. Others can fill in the details for those topics. *Attendees*: John Zhuge Ryan Blue Yifei Huang Matt Cheah Yuanjian Li Russell Spitzer Kevin Yu *Topics*: - Atomic extensions for the TableCatalog

DataSourceV2 sync notes - 15 May 2019

2019-05-29 Thread Ryan Blue
Sorry these notes are so late, I didn’t get to the write up until now. As usual, if anyone has corrections or comments, please reply. *Attendees*: John Zhuge Ryan Blue Andrew Long Wenchen Fan Gengliang Wang Russell Spitzer Yuanjian Li Yifei Huang Matt Cheah Amardeep Singh Dhilon Zhilmil Dhion

Re: DataSourceV2Reader Q

2019-05-21 Thread Ryan Blue
.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:89) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:41) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:541) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:763) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:463) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:209)] > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

DataSourceV2 community sync notes - 1 May 2019

2019-05-06 Thread Ryan Blue
Here are my notes for the latest DSv2 community sync. As usual, if you have comments or corrections, please reply. If you’d like to be invited to the next sync, email me directly. Everyone is welcome to attend. *Attendees*: Ryan Blue John Zhuge Andrew Long Bruce Robbins Dilip Biswal Gengliang

Re: Bucketing and catalyst

2019-05-02 Thread Ryan Blue
Catalyst? I’ve been trying to piece together how > Catalyst knows that it can remove a sort and shuffle given that both tables > are bucketed and sorted the same way. Is there any classes in particular I > should look at? > > > > Cheers Andrew > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 sync, 17 April 2019

2019-04-29 Thread Ryan Blue
orges Perrin > j...@jgp.net > > > > On Apr 19, 2019, at 10:10, Ryan Blue wrote: > > Here are my notes from the last DSv2 sync. As always: > >- If you’d like to attend the sync, send me an email and I’ll add you >to the invite. Everyone is welcome. >

DataSourceV2 sync, 17 April 2019

2019-04-19 Thread Ryan Blue
*: - TableCatalog PR #24246: https://github.com/apache/spark/pull/24246 - Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233 - Streaming capabilities PR #24129: https://github.com/apache/spark/pull/24129 *Attendees*: Ryan Blue John Zhuge Matt Cheah Yifei Huang Bruce Robbins Jamison

Re: Spark 2.4.2

2019-04-16 Thread Ryan Blue
r. Do you have a different proposal about how this should > be handled? > > On Tue, Apr 16, 2019 at 4:23 PM Ryan Blue wrote: > >> Is this a bug fix? It looks like a new feature to me. >> >> On Tue, Apr 16, 2019 at 4:13 PM Michael Armbrust >> wrote: >> &

Re: Spark 2.4.2

2019-04-16 Thread Ryan Blue
'd like to start the process shortly. > > Michael > -- Ryan Blue Software Engineer Netflix

Re: Thoughts on dataframe cogroup?

2019-04-15 Thread Ryan Blue
t;>> >>>>>>>>> Hello, >>>>>>>>> I fail to see how an equi-join on the key columns is different >>>>>>>>> than the cogroup you propose. >>>>>>>>> >>>>>>>>> I

Re: Dataset schema incompatibility bug when reading column partitioned data

2019-04-11 Thread Ryan Blue
gt;> That is, when reading column partitioned Parquet files the explicitly >> specified schema is not adhered to, instead the partitioning columns are >> appended the end of the column list. This is a quite severe issue as some >> operations, such as union, fails if columns are in

DataSourceV2 sync 3 April 2019

2019-04-04 Thread Ryan Blue
Ryan Blue John Zhuge Russel Spitzer Gengliang Wang Yuanjian Li Matt Cheah Yifei Huang Felix Cheung Dilip Biswal Wenchen Fan -- Ryan Blue Software Engineer Netflix

Re: Closing a SparkSession stops the SparkContext

2019-04-03 Thread Ryan Blue
he SparkContext but is state only needed by one > > SparkSession and that there isn't any way to clean up now, that's a > > compelling reason to change the API. Is that the situation? The only > > downside is making the user separately stop the SparkContext then. > >

Re: Closing a SparkSession stops the SparkContext

2019-04-02 Thread Ryan Blue
trying to understand why this is the intended behavior – anyone > have any knowledge of why this is the case? > > > > > > > > Thanks, > > > > Vinoo > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Manually reading parquet files.

2019-03-21 Thread Ryan Blue
doopConfWithOptions(relation.options)) > ) > > *import *scala.collection.JavaConverters._ > > *val *rows = readFile(pFile).flatMap(_ *match *{ > *case *r: InternalRow => *Seq*(r) > > // This doesn't work. vector mode is doing something screwy > *case *b: ColumnarBatch => b.rowIterator().asScala > }).toList > > *println*(rows) > //List([0,1,5b,24,66647361]) > //??this is wrong I think > > > > Has anyone attempted something similar? > > > > Cheers Andrew > > > -- Ryan Blue Software Engineer Netflix

Re: Hive Hash in Spark

2019-03-06 Thread Ryan Blue
artitioned using Hive Hash? By >> understanding, I mean that I’m able to avoid a full shuffle join on Table A >> (partitioned by Hive Hash) when joining with a Table B that I can shuffle >> via Hive Hash to Table A. >> >> >> >> Thank you, >> >> Tyson >> > > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 sync notes - 20 Feb 2019

2019-03-05 Thread Ryan Blue
e to join? > > Stavros > > On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue > wrote: > >> Here are my notes from the DSv2 sync last night. As always, if you have >> corrections, please reply with them. And if you’d like to be included on >> the invite to participate

[RESULT] [VOTE] Functional DataSourceV2 in Spark 3.0

2019-03-03 Thread Ryan Blue
This vote fails with the following counts: 3 +1 votes: - Matt Cheah - Ryan Blue - Sean Owen (binding) 1 -0 vote: - Jose Torres 2 -1 votes: - Mark Hamstra (binding) - Midrul Muralidharan (binding) Thanks for the discussion, everyone, It sounds to me that the main objection

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Ryan Blue
Actually, I went ahead and removed the confusing section. There is no public API in the doc now, so that it is clear that it isn't a relevant part of this vote. On Fri, Mar 1, 2019 at 4:58 PM Ryan Blue wrote: > I moved the public API to the "Implementation Sketch" sect

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Ryan Blue
t; anthony.young-gar...@cloudera.com.invalid> wrote: > >> +1 (non-binding) >> >> On Thu, Feb 28, 2019 at 5:54 PM John Zhuge wrote: >> >> +1 (non-binding) >> >> On Thu, Feb 28, 2019 at 9:11 AM Matt Cheah wrote: >> >> +1 (non-binding) >> &

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
Sv2 at the end to meet a deadline? all the better to, > if anything, agree it's important now. It's also an agreement to delay > the release for it, not rush it. I don't see that later is a better > time to make the decision, if rush is a worry? > > Given my definition, and understanding of t

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
roject management, IMO. > > On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue wrote: > >> Mark, if this goal is adopted, "we" is the Apache Spark community. >> >> On Thu, Feb 28, 2019 at 9:52 AM Mark Hamstra >> wrote: >> >>> Who is "we&qu

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
tt Cheah wrote: > >> +1 (non-binding) >> >> >> >> Are identifiers and namespaces going to be rolled under one of those six >> points? >> >> >> >> *From: *Ryan Blue >> *Reply-To: *"rb...@netflix.com" >> *Date: *Thursd

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-28 Thread Ryan Blue
u, Feb 28, 2019 at 1:00 AM Ryan Blue wrote: > >> I think that's a good plan. Let's get the functionality done, but mark it >> experimental pending a new row API. >> >> So is there agreement on this set of work, then? >> >> On Tue, Feb 26, 2019 at 6:30 PM Matei

[VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
(e.g., INSERT INTO support) Please vote in the next 3 days on whether you agree with committing to this goal. [ ] +1: Agree that we should consider a functional DSv2 implementation a blocker for Spark 3.0 [ ] +0: . . . [ ] -1: I disagree with this goal because . . . Thank you! -- Ryan Blue

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-02-28 Thread Ryan Blue
+1 (non-binding) On Wed, Feb 27, 2019 at 8:34 PM Russell Spitzer wrote: > +1 (non-binding) > > On Wed, Feb 27, 2019, 6:28 PM Ryan Blue wrote: > >> Hi everyone, >> >> In the last DSv2 sync, the consensus was that the table metadata SPIP was >> ready to bri

[VOTE] SPIP: Spark API for Table Metadata

2019-02-27 Thread Ryan Blue
doc <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d> . Please vote in the next 3 days. [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don't think this is a good idea because ... Thanks! -- Ryan Blue Software En

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-27 Thread Ryan Blue
x that before we declare dev2 is stable, because > InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. > > > > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: > > Will that then require an API break down the line? Do we save that for > Spark 4? &

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
26, 2019 at 4:41 PM Matt Cheah wrote: > Reynold made a note earlier about a proper Row API that isn’t InternalRow > – is that still on the table? > > > > -Matt Cheah > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Tuesday, F

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
t I have a > problem with, are declarative, pseudo-authoritative statements that 3.0 (or > some other release) will or won't contain some feature, API, etc. or that > some issue is or is not blocker or worth delaying for. When the PMC has not > voted on such issues, I'm often left thinking, "Wait... what? Who decided > that, or where did that decision come from?" > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-26 Thread Ryan Blue
wrote: >> >>> +1 >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >> -- >> --- >> Takeshi Yamamuro >> > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Ryan Blue
id not target new features in the Spark 2.0.0 release. >>> The fact that some entity other than the PMC thinks that Spark 3.0 should >>> contain certain new features or that it will be costly to them if 3.0 does >>> not contain those features is not dispositive. If there are public API >>> changes that should occur in a timely fashion and there is also a list of >>> new features that some users or contributors want to see in 3.0 but that >>> look likely to not be ready in a timely fashion, then the PMC should fully >>> consider releasing 3.0 without all those new features. There is no reason >>> that they can't come in with 3.1.0. >>> >> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
also the features that have remained open for the longest time > and we really need to move forward on these. Putting a target release for > 3.0 will help in that regard. > > > > -Matt Cheah > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *D

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
an around when the release branch will likely be cut. > > Matei > > > On Feb 21, 2019, at 1:03 PM, Ryan Blue > wrote: > > > > Hi everyone, > > > > In the DSv2 sync last night, we had a discussion about roadmap and what > the goal should be for getting th

[DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
years to get the work done. Are there any objections to targeting 3.0 for this? In addition, much of the planning for multi-catalog support has been done to make v2 possible. Do we also want to include multi-catalog support? rb -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes - 20 Feb 2019

2019-02-21 Thread Ryan Blue
contains sort information, but it isn’t used because it applies only to single files. - *Consensus formed not including sorts in v2 table metadata.* *Attendees*: Ryan Blue John Zhuge Donjoon Hyun Felix Cheung Gengliang Wang Hyukji Kwon Jacky Lee Jamison Bennett Matt Cheah Yifei Huang Russel

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-19 Thread Ryan Blue
:33 AM Maryann Xue > wrote: > >> +1 >> >> On Mon, Feb 18, 2019 at 10:46 PM John Zhuge wrote: >> >>> +1 >>> >>> On Mon, Feb 18, 2019 at 8:43 PM Dongjoon Hyun >>> wrote: >>> >>>> +1 >>>>

[VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Ryan Blue
in the next 3 days. [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don't think this is a good idea because ... Thanks! rb -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Ryan Blue
Sure. I'll start a thread. On Mon, Feb 18, 2019 at 6:27 PM Wenchen Fan wrote: > I think this is the right direction to go. Shall we move forward with a > vote and detailed designs? > > On Mon, Feb 4, 2019 at 9:57 AM Ryan Blue wrote: > >> Hi everyone, >

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Ryan Blue
I can understand retaining > old behavior under a flag where the behavior change could be > problematic for some users or facilitate migration, but this is just a > change to some UI links no? the underlying links don't change. > On Fri, Feb 8, 2019 at 5:41 PM Ryan Blue wrote: >

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Ryan Blue
f flag option to just get one url or > default two stdout/stderr urls. > 3. We could let users enumerate file names they want to link, and create > log links for each file. > > Which one do you suggest? > > 2019년 2월 9일 (토) 오전 8:24, Ryan Blue 님이 작성: > >> Jungtaek, >> >

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Ryan Blue
og overview > > page would make much more sense for us. We work it around with a custom > > submit process that logs all important URLs on the submit side log. > > > > > > > > 2019년 2월 9일 (토) 오전 5:42, Ryan Blue 님이 작성: > >> > >> Here's what I see from a run

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Ryan Blue
o remove file part manually from URL to > access list page. Instead of this we may be able to change default URL to > show all of local logs and let users choose which file to read. (though it > would be two-clicks to access to actual file) > > -Jungtaek Lim (HeartSaVioR) > > 2

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-07 Thread Ryan Blue
would put more files then only > stdout and stderr (like gc logs). > > SPARK-23155 provides the way to modify log URL but it's only applied to > SHS, and in Spark UI in running apps it still only shows "stdout" and > "stderr". SPARK-26792 is for applying this to Spark UI as well, but I've > got suggestion to just change the default log URL. > > Thanks again, > Jungtaek Lim (HeartSaVioR) > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 producing wrong date value in Custom Data Writer

2019-02-05 Thread Ryan Blue
get(0, > DataTypes.DateType)); > > } > > It prints an integer as output: > > MyDataWriter.write: 17039 > > > Is this a bug? or I am doing something wrong? > > Thanks, > Shubham > -- Ryan Blue Software Engineer Netflix

Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
, and partitioning is already supported. The idea to use conditions to create separate data frames would actually make that harder because you'd need to create and name tables for each one. On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo wrote: > Hello Ryan, > > On Mon, Feb 4, 2019 at 10:52 AM

Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
t; > -- > > > > Moein Hosseini > > Data Engineer > > mobile: +98 912 468 1859 > > site: www.moein.xyz > > email: moein...@gmail.com > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

[DISCUSS] SPIP: Identifiers for multi-catalog Spark

2019-02-03 Thread Ryan Blue
discussion. From the feedback in the DSv2 sync and on the previous thread, I think it should go quickly. Thanks for taking a look at the proposal, rb -- Ryan Blue

Re: Purpose of broadcast timeout

2019-01-30 Thread Ryan Blue
e broadcast timeout really meant to be a timeout on > sparkContext.broadcast, instead of the child.executeCollectIterator()? In > that case, would it make sense to move the timeout to wrap only > sparkContext.broadcast? > > Best, > > Justin > -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes

2019-01-28 Thread Ryan Blue
- Ryan: next time, we should talk about the set of metadata proposed for TableCatalog, but we’re out of time. *Attendees*: Ryan Blue John Zhuge Reynold Xin Xiao Li Dongjoon Hyun Eric Wohlstadter Hyukjin Kwon Jacky Lee Jamison Bennett Kevin Yu Yuanjian Li Maryann Xue Matt Cheah Dale Richards

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-22 Thread Ryan Blue
scheme will need to play nice with column identifier as > well. > > > > > -- > > *From:* Ryan Blue > *Sent:* Thursday, January 17, 2019 11:38 AM > *To:* Spark Dev List > *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support > &

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-17 Thread Ryan Blue
Any discussion on how Spark should manage identifiers when multiple catalogs are supported? I know this is an area where a lot of people are interested in making progress, and it is a blocker for both multi-catalog support and CTAS in DSv2. On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote: >

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-16 Thread Ryan Blue
a long time (say .. until > Spark 4.0.0?). > > > > I know somehow it happened to be sensitive but to be just literally > honest to myself, I think we should make a try. > > > > > -- > Marcelo > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
we are super 100% dependent on Hive... >> >> >> -- >> *From:* Ryan Blue >> *Sent:* Tuesday, January 15, 2019 9:53 AM >> *To:* Xiao Li >> *Cc:* Yuming Wang; dev >> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4 >

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
st PR <https://github.com/apache/spark/pull/23552> does not >> contain the changes of hive-thriftserver. Please ignore the failed test in >> hive-thriftserver. >> >> The second PR <https://github.com/apache/spark/pull/23553> is complete >> changes. >> >> >> >> I have created a Spark distribution for Apache Hadoop 2.7, you might >> download it via Google Drive >> <https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> or Baidu >> Pan <https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>. >> >> Please help review and test. Thanks. >> > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-13 Thread Ryan Blue
to support path-based tables by adding a path to CatalogIdentifier, either as a namespace or as a separate optional string. Then, the identifier passed to a catalog would work for either a path-based table or a catalog table, without needing a path-based catalog API. Thoughts? On Sun, Jan 13, 2019 at 1:

[DISCUSS] Identifiers with multi-catalog support

2019-01-13 Thread Ryan Blue
are tables and not nested namespaces. How would Spark handle arbitrary nesting that differs across catalogs? Hopefully, I’ve captured the design question well enough for a productive discussion. Thanks! rb -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes

2019-01-10 Thread Ryan Blue
Here are my notes from the DSv2 sync last night. *As usual, I didn’t take great notes because I was participating in the discussion. Feel free to send corrections or clarification.* *Attendees*: Ryan Blue John Zhuge Xiao Li Reynold Xin Felix Cheung Anton Okolnychyi Bruce Robbins Dale Richardson

DataSourceV2 community sync tonight

2019-01-09 Thread Ryan Blue
an also talk about the user-facing API <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.cgnrs9vys06x> proposed in the SPIP. Thanks, rb -- Ryan Blue Software Engineer Netflix

Re: Trigger full GC during executor idle time?

2018-12-31 Thread Ryan Blue
the tune of 2-6%. Has anyone >> considered this before? >> >> Sean >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Ryan Blue
; >>> > Following this direction, it makes more sense to delegate everything >>> to data sources. >>> > >>> > As the first step, maybe we should not add DDL commands to change >>> schema of data source, but just use the capability API to let data s

Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Ryan Blue
writing. > Users can use native client of data source to change schema. > > > > On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue wrote: > >> > >> I think it is good to know that not all sources support default values. > That makes me think that we should delegat

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Ryan Blue
l that we should follow RDBMS/SQL standard >> regarding the behavior? >> >> > pass the default through to the underlying data source >> >> This is one way to implement the behavior. >> >> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue wrote: >> >>>

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
sers. > > On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue wrote: > >> Wenchen, can you give more detail about the different ADD COLUMN syntax? >> That sounds confusing to end users to me. >> >> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan wrote: >> >>> N

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
ng data, fill the missing column with the initial default >>> value >>> 4. when writing data, fill the missing column with the latest default >>> value >>> 5. when altering a column to change its default value, only update the >>> latest defa

[DISCUSS] Default values and data sources

2018-12-18 Thread Ryan Blue
that this complexity probably isn’t worth consistency in default values across sources, if that is even achievable. In the sync we thought it was a good idea to send this out to the larger group to discuss. Please reply with comments! rb -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes (#4)

2018-12-18 Thread Ryan Blue
to the invite list, just let me know. Everyone is welcome. rb *Attendees*: Ryan Blue Xiao Li Bruce Robbins John Zhuge Anton Okolnychyi Jackey Lee Jamison Bennett Srabasti Banerjee Thomas D’Silva Wenchen Fan Matt Cheah Maryann Xue (possibly others that entered after the start) *Agenda

Re: [DISCUSS] Function plugins

2018-12-18 Thread Ryan Blue
; *Cc: *Spark Dev List >> *Subject: *Re: [DISCUSS] Function plugins >> >> >> >> [image: Image removed by sender.] >> >> Having a way to register UDFs that are not using Hive APIs would be great! >> >> >> >> >> >> >> >&

[DISCUSS] Function plugins

2018-12-14 Thread Ryan Blue
ave to solve challenges with function naming (whether there is a db component). Right now I’d like to think through the overall idea and not get too focused on those details. Thanks, rb -- Ryan Blue Software Engineer Netflix

Re: Self join

2018-12-13 Thread Ryan Blue
d to address this). > I'd consider this as a brainstorming email thread. Once we have a good > proposal, then we can go ahead with a SPIP. > > Thanks, > Marco > > Il giorno mer 12 dic 2018 alle ore 19:13 Ryan Blue ha > scritto: > >> Marco, >> >>

Re: dsv2 remaining work

2018-12-13 Thread Ryan Blue
esign for each data > type. > > The above are the big one I can think of. I probably missed some, but a > lot of other smaller things can be improved on later. > > > > > > > -- Ryan Blue Software Engineer Netflix

Re: Self join

2018-12-12 Thread Ryan Blue
people that are looking at it now are the ones already familiar with the problem. rb On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido wrote: > Thank you all for your answers. > > @Ryan Blue sure, let me state the problem more > clearly: imagine you have 2 dataframes with a co

Re: Self join

2018-12-11 Thread Ryan Blue
nges in the design, we can do that. > > Thoughts on this? > > Thanks, > Marco > -- Ryan Blue Software Engineer Netflix

Re: Pushdown in DataSourceV2 question

2018-12-11 Thread Ryan Blue
f = spark.read.json("s3://sample_bucket/people.json") >>>>> > df.printSchema() >>>>> > df.filter($"age" > 20).explain() >>>>> > >>>>> > root >>>>> > |-- age: long (nullable = true) >>>>> > |-- name: string (nullable = true) >>>>> > >>>>> > == Physical Plan == >>>>> > *Project [age#47L, name#48] >>>>> > +- *Filter (isnotnull(age#47L) && (age#47L > 20)) >>>>> >+- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, >>>>> Location: InMemoryFileIndex[s3://sample_bucket/people.json], >>>>> PartitionFilters: [], PushedFilters: [IsNotNull(age), >>>>> GreaterThan(age,20)], >>>>> ReadSchema: struct >>>>> > >>>>> > # Comments >>>>> > As you can see, PushedFilter is shown even if input data is JSON. >>>>> > Actually this pushdown is not used. >>>>> > >>>>> > I'm wondering if it has been already discussed or not. >>>>> > If not, this is a chance to have such feature in DataSourceV2 >>>>> because it would require some API level changes. >>>>> > >>>>> > >>>>> > Warm regards, >>>>> > >>>>> > Noritaka Sekiyama >>>>> > >>>>> >>>>> - >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> -- Ryan Blue Software Engineer Netflix

Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional

2018-12-10 Thread Ryan Blue
eply to the > sender that you have received this communication in error and then delete > it. > > Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, > pode conter informação privilegiada ou confidencial e é para uso exclusivo > da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário > indicado, fica notificado de que a leitura, utilização, divulgação e/ou > cópia sem autorização pode estar proibida em virtude da legislação vigente. > Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique > imediatamente por esta mesma via e proceda a sua destruição > -- Ryan Blue Software Engineer Netflix

<    1   2   3   4   >