Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
ate existing data. >>> 3. when reading data, fill the missing column with the initial default >>> value >>> 4. when writing data, fill the missing column with the latest default >>> value >>> 5. when altering a column to change its default va

Re: [DISCUSS] Default values and data sources

2018-12-19 Thread Ryan Blue
e is > decided by the end-users. > > On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue wrote: > >> Wenchen, can you give more detail about the different ADD COLUMN syntax? >> That sounds confusing to end users to me. >> >> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Ryan Blue
with my proposal that we should follow RDBMS/SQL standard >> regarding the behavior? >> >> > pass the default through to the underlying data source >> >> This is one way to implement the behavior. >> >> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue wrote: &g

Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Ryan Blue
table schema during writing. > Users can use native client of data source to change schema. > > > > On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue wrote: > >> > >> I think it is good to know that not all sources support default values. > That makes me think that we s

Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Ryan Blue
rce supports. >>> > >>> > Following this direction, it makes more sense to delegate everything >>> to data sources. >>> > >>> > As the first step, maybe we should not add DDL commands to change >>> schema of data source, but just use th

Re: Trigger full GC during executor idle time?

2018-12-31 Thread Ryan Blue
g this can speed things up to the tune of 2-6%. Has anyone >> considered this before? >> >> Sean >> >> ----- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

DataSourceV2 community sync tonight

2019-01-09 Thread Ryan Blue
an also talk about the user-facing API <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.cgnrs9vys06x> proposed in the SPIP. Thanks, rb -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes

2019-01-10 Thread Ryan Blue
Here are my notes from the DSv2 sync last night. *As usual, I didn’t take great notes because I was participating in the discussion. Feel free to send corrections or clarification.* *Attendees*: Ryan Blue John Zhuge Xiao Li Reynold Xin Felix Cheung Anton Okolnychyi Bruce Robbins Dale Richardson

[DISCUSS] Identifiers with multi-catalog support

2019-01-13 Thread Ryan Blue
tables and not nested namespaces. How would Spark handle arbitrary nesting that differs across catalogs? Hopefully, I’ve captured the design question well enough for a productive discussion. Thanks! rb -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-13 Thread Ryan Blue
support path-based tables by adding a path to CatalogIdentifier, either as a namespace or as a separate optional string. Then, the identifier passed to a catalog would work for either a path-based table or a catalog table, without needing a path-based catalog API. Thoughts? On Sun, Jan 13, 2019 at 1:

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
y review. >> >> The first PR <https://github.com/apache/spark/pull/23552> does not >> contain the changes of hive-thriftserver. Please ignore the failed test in >> hive-thriftserver. >> >> The second PR <https://github.com/apache/spark/pull/23553> is complete >> changes. >> >> >> >> I have created a Spark distribution for Apache Hadoop 2.7, you might >> download it via Google Drive >> <https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> or Baidu >> Pan <https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>. >> >> Please help review and test. Thanks. >> > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
; >> And we are super 100% dependent on Hive... >> >> >> -- >> *From:* Ryan Blue >> *Sent:* Tuesday, January 15, 2019 9:53 AM >> *To:* Xiao Li >> *Cc:* Yuming Wang; dev >> *Subject:* Re: [DISCUSS] Upgrade built-in Hi

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-16 Thread Ryan Blue
s > > we're going to stay in 1.2.x for, at least, a long time (say .. until > Spark 4.0.0?). > > > > I know somehow it happened to be sensitive but to be just literally > honest to myself, I think we should make a try. > > > > > -- > Marcelo > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-17 Thread Ryan Blue
Any discussion on how Spark should manage identifiers when multiple catalogs are supported? I know this is an area where a lot of people are interested in making progress, and it is a blocker for both multi-catalog support and CTAS in DSv2. On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote: >

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-22 Thread Ryan Blue
whole scheme will need to play nice with column identifier as > well. > > > > > -- > > *From:* Ryan Blue > *Sent:* Thursday, January 17, 2019 11:38 AM > *To:* Spark Dev List > *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support &

DataSourceV2 sync notes

2019-01-28 Thread Ryan Blue
- Ryan: next time, we should talk about the set of metadata proposed for TableCatalog, but we’re out of time. *Attendees*: Ryan Blue John Zhuge Reynold Xin Xiao Li Dongjoon Hyun Eric Wohlstadter Hyukjin Kwon Jacky Lee Jamison Bennett Kevin Yu Yuanjian Li Maryann Xue Matt Cheah Dale Richards

Re: Purpose of broadcast timeout

2019-01-30 Thread Ryan Blue
timeout. Perhaps > is the broadcast timeout really meant to be a timeout on > sparkContext.broadcast, instead of the child.executeCollectIterator()? In > that case, would it make sense to move the timeout to wrap only > sparkContext.broadcast? > > Best, > > Justin > -- Ryan Blue Software Engineer Netflix

[DISCUSS] SPIP: Identifiers for multi-catalog Spark

2019-02-03 Thread Ryan Blue
t ongoing discussion. From the feedback in the DSv2 sync and on the previous thread, I think it should go quickly. Thanks for taking a look at the proposal, rb -- Ryan Blue

Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
; > Moein > > > > -- > > > > Moein Hosseini > > Data Engineer > > mobile: +98 912 468 1859 > > site: www.moein.xyz > > email: moein...@gmail.com > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
data, and partitioning is already supported. The idea to use conditions to create separate data frames would actually make that harder because you'd need to create and name tables for each one. On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo wrote: > Hello Ryan, > > On Mon, Feb 4, 2019 at

Re: DataSourceV2 producing wrong date value in Custom Data Writer

2019-02-05 Thread Ryan Blue
r.write: " + record.get(0, > DataTypes.DateType)); > > } > > It prints an integer as output: > > MyDataWriter.write: 17039 > > > Is this a bug? or I am doing something wrong? > > Thanks, > Shubham > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-07 Thread Ryan Blue
ecause in real case end users would put more files then only > stdout and stderr (like gc logs). > > SPARK-23155 provides the way to modify log URL but it's only applied to > SHS, and in Spark UI in running apps it still only shows "stdout" and > "stderr". SPARK-26792 is for applying this to Spark UI as well, but I've > got suggestion to just change the default log URL. > > Thanks again, > Jungtaek Lim (HeartSaVioR) > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Ryan Blue
ers have to remove file part manually from URL to > access list page. Instead of this we may be able to change default URL to > show all of local logs and let users choose which file to read. (though it > would be two-clicks to access to actual file) > > -Jungtaek Lim (HeartSaVioR) &

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Ryan Blue
uring debugging. So linking the YARN container log overview > > page would make much more sense for us. We work it around with a custom > > submit process that logs all important URLs on the submit side log. > > > > > > > > 2019년 2월 9일 (토) 오전 5:42, Ryan Blue 님이 작성: &

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Ryan Blue
urning on/off flag option to just get one url or > default two stdout/stderr urls. > 3. We could let users enumerate file names they want to link, and create > log links for each file. > > Which one do you suggest? > > 2019년 2월 9일 (토) 오전 8:24, Ryan Blue 님이 작성: > >> Jungtaek

Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-08 Thread Ryan Blue
punted to the user. I can understand retaining > old behavior under a flag where the behavior change could be > problematic for some users or facilitate migration, but this is just a > change to some UI links no? the underlying links don't change. > On Fri, Feb 8, 2019 at 5:41 PM Ry

Re: [DISCUSS] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Ryan Blue
Sure. I'll start a thread. On Mon, Feb 18, 2019 at 6:27 PM Wenchen Fan wrote: > I think this is the right direction to go. Shall we move forward with a > vote and detailed designs? > > On Mon, Feb 4, 2019 at 9:57 AM Ryan Blue wrote: > >> Hi everyone, >&g

[VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread Ryan Blue
n the next 3 days. [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don't think this is a good idea because ... Thanks! rb -- Ryan Blue Software Engineer Netflix

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-19 Thread Ryan Blue
:33 AM Maryann Xue > wrote: > >> +1 >> >> On Mon, Feb 18, 2019 at 10:46 PM John Zhuge wrote: >> >>> +1 >>> >>> On Mon, Feb 18, 2019 at 8:43 PM Dongjoon Hyun >>> wrote: >>> >>>> +1 >>>>

DataSourceV2 sync notes - 20 Feb 2019

2019-02-21 Thread Ryan Blue
contains sort information, but it isn’t used because it applies only to single files. - *Consensus formed not including sorts in v2 table metadata.* *Attendees*: Ryan Blue John Zhuge Donjoon Hyun Felix Cheung Gengliang Wang Hyukji Kwon Jacky Lee Jamison Bennett Matt Cheah Yifei Huang Russel

[DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
ing 2 years to get the work done. Are there any objections to targeting 3.0 for this? In addition, much of the planning for multi-catalog support has been done to make v2 possible. Do we also want to include multi-catalog support? rb -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
> people can still plan around when the release branch will likely be cut. > > Matei > > > On Feb 21, 2019, at 1:03 PM, Ryan Blue > wrote: > > > > Hi everyone, > > > > In the DSv2 sync last night, we had a discussion about roadmap and what > the goal

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
features that have remained open for the longest time > and we really need to move forward on these. Putting a target release for > 3.0 will help in that regard. > > > > -Matt Cheah > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *D

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Ryan Blue
lar major release, that's fine -- in fact, we >>> quite intentionally did not target new features in the Spark 2.0.0 release. >>> The fact that some entity other than the PMC thinks that Spark 3.0 should >>> contain certain new features or that it will be costly to them if 3.0 does >>> not contain those features is not dispositive. If there are public API >>> changes that should occur in a timely fashion and there is also a list of >>> new features that some users or contributors want to see in 3.0 but that >>> look likely to not be ready in a timely fashion, then the PMC should fully >>> consider releasing 3.0 without all those new features. There is no reason >>> that they can't come in with 3.1.0. >>> >> -- Ryan Blue Software Engineer Netflix

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-26 Thread Ryan Blue
kyLee wrote: >> >>> +1 >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >>> >>> ----- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >> -- >> --- >> Takeshi Yamamuro >> > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
a good idea. What is a problem, or is at least something that I have a > problem with, are declarative, pseudo-authoritative statements that 3.0 (or > some other release) will or won't contain some feature, API, etc. or that > some issue is or is not blocker or worth delaying for. When the PMC has not > voted on such issues, I'm often left thinking, "Wait... what? Who decided > that, or where did that decision come from?" > > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
cement. On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah wrote: > Reynold made a note earlier about a proper Row API that isn’t InternalRow > – is that still on the table? > > > > -Matt Cheah > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" >

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-27 Thread Ryan Blue
have to fix that before we declare dev2 is stable, because > InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. > > > > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: > > Will that then require an API break down the line? Do we save that for >

[VOTE] SPIP: Spark API for Table Metadata

2019-02-27 Thread Ryan Blue
posal doc <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d> . Please vote in the next 3 days. [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don't think this is a good idea because ... Thanks! -- Ryan Blue S

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-02-28 Thread Ryan Blue
+1 (non-binding) On Wed, Feb 27, 2019 at 8:34 PM Russell Spitzer wrote: > +1 (non-binding) > > On Wed, Feb 27, 2019, 6:28 PM Ryan Blue wrote: > >> Hi everyone, >> >> In the last DSv2 sync, the consensus was that the table metadata SPIP was >> ready to bri

[VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
(e.g., INSERT INTO support) Please vote in the next 3 days on whether you agree with committing to this goal. [ ] +1: Agree that we should consider a functional DSv2 implementation a blocker for Spark 3.0 [ ] +0: . . . [ ] -1: I disagree with this goal because . . . Thank you! -- Ryan Blue

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-28 Thread Ryan Blue
> > On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue wrote: > >> I think that's a good plan. Let's get the functionality done, but mark it >> experimental pending a new row API. >> >> So is there agreement on this set of work, then? >> >> On Tue, Feb 2

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
AM Matt Cheah wrote: > >> +1 (non-binding) >> >> >> >> Are identifiers and namespaces going to be rolled under one of those six >> points? >> >> >> >> *From: *Ryan Blue >> *Reply-To: *"rb...@netflix.com" >> *Dat

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
ases is not > proper project management, IMO. > > On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue wrote: > >> Mark, if this goal is adopted, "we" is the Apache Spark community. >> >> On Thu, Feb 28, 2019 at 9:52 AM Mark Hamstra >> wrote: >> >&g

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
few PRs in review" issue? you worry that > we might rush DSv2 at the end to meet a deadline? all the better to, > if anything, agree it's important now. It's also an agreement to delay > the release for it, not rush it. I don't see that later is a better > time to make th

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Ryan Blue
Young-Garner < > anthony.young-gar...@cloudera.com.invalid> wrote: > >> +1 (non-binding) >> >> On Thu, Feb 28, 2019 at 5:54 PM John Zhuge wrote: >> >> +1 (non-binding) >> >> On Thu, Feb 28, 2019 at 9:11 AM Matt Cheah wrote: >> >> +1 (non

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Ryan Blue
Actually, I went ahead and removed the confusing section. There is no public API in the doc now, so that it is clear that it isn't a relevant part of this vote. On Fri, Mar 1, 2019 at 4:58 PM Ryan Blue wrote: > I moved the public API to the "Implementation Sketch" section. Th

[RESULT] [VOTE] Functional DataSourceV2 in Spark 3.0

2019-03-03 Thread Ryan Blue
This vote fails with the following counts: 3 +1 votes: - Matt Cheah - Ryan Blue - Sean Owen (binding) 1 -0 vote: - Jose Torres 2 -1 votes: - Mark Hamstra (binding) - Midrul Muralidharan (binding) Thanks for the discussion, everyone, It sounds to me that the main objection

Re: DataSourceV2 sync notes - 20 Feb 2019

2019-03-05 Thread Ryan Blue
people to join? > > Stavros > > On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue > wrote: > >> Here are my notes from the DSv2 sync last night. As always, if you have >> corrections, please reply with them. And if you’d like to be included on >> the invite to partic

Re: Hive Hash in Spark

2019-03-06 Thread Ryan Blue
partitioned using Hive Hash? By >> understanding, I mean that I’m able to avoid a full shuffle join on Table A >> (partitioned by Hive Hash) when joining with a Table B that I can shuffle >> via Hive Hash to Table A. >> >> >> >> Thank you, >> >> Tyson >> > > -- Ryan Blue Software Engineer Netflix

Re: Spark Improvement Proposals

2016-10-10 Thread Ryan Blue
> >>> >>>>>>>> do > >>> >>>>>>>> more than > >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to have a > >>> >>>>>>>> new > >>> >>>>>>>> type of > >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows all > >>> >>>>>>>> such > >>> >>>>>>>> JIRAs from > >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and > design > >>> >>>>>>>> doc > >>> >>>>>>>> templates (in fact many projects have them). > >>> >>>>>>>> > >>> >>>>>>>> Matei > >>> >>>>>>>> > >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> > >>> >>>>>>>> wrote: > >>> >>>>>>>> > >>> >>>>>>>> I called Cody last night and talked about some of the topics > in > >>> >>>>>>>> his > >>> >>>>>>>> email. > >>> >>>>>>>> It became clear to me Cody genuinely cares about the project. > >>> >>>>>>>> > >>> >>>>>>>> Some of the frustrations come from the success of the project > >>> >>>>>>>> itself > >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity from > >>> >>>>>>>> people > >>> >>>>>>>> who > >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is in some > >>> >>>>>>>> ways > >>> >>>>>>>> similar > >>> >>>>>>>> to scaling an engineering team in a successful startup: old > >>> >>>>>>>> processes that > >>> >>>>>>>> worked well might not work so well when it gets to a certain > >>> >>>>>>>> size, > >>> >>>>>>>> cultures > >>> >>>>>>>> can get diluted, building culture vs building process, etc. > >>> >>>>>>>> > >>> >>>>>>>> I also really like to have a more visible process for larger > >>> >>>>>>>> changes, > >>> >>>>>>>> especially major user facing API changes. Historically we > upload > >>> >>>>>>>> design docs > >>> >>>>>>>> for major changes, but it is not always consistent and > difficult > >>> >>>>>>>> to > >>> >>>>>>>> quality > >>> >>>>>>>> of the docs, due to the volunteering nature of the > organization. > >>> >>>>>>>> > >>> >>>>>>>> Some of the more concrete ideas we discussed focus on > building a > >>> >>>>>>>> culture > >>> >>>>>>>> to improve clarity: > >>> >>>>>>>> > >>> >>>>>>>> - Process: Large changes should have design docs posted on > JIRA. > >>> >>>>>>>> One > >>> >>>>>>>> thing > >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to me is > we > >>> >>>>>>>> should > >>> >>>>>>>> create a design doc template for the project and ask everybody > >>> >>>>>>>> to > >>> >>>>>>>> follow. > >>> >>>>>>>> The design doc template should also explicitly list goals and > >>> >>>>>>>> non-goals, to > >>> >>>>>>>> make design doc more consistent. > >>> >>>>>>>> > >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some this > >>> >>>>>>>> with > >>> >>>>>>>> some > >>> >>>>>>>> changes, but again very inconsistent. Just posting something > on > >>> >>>>>>>> JIRA > >>> >>>>>>>> isn't > >>> >>>>>>>> sufficient, because there are simply too many JIRAs and the > >>> >>>>>>>> signal > >>> >>>>>>>> get lost > >>> >>>>>>>> in the noise. While this is generally impossible to enforce > >>> >>>>>>>> because > >>> >>>>>>>> we can't > >>> >>>>>>>> force all volunteers to conform to a process (or they might > not > >>> >>>>>>>> even > >>> >>>>>>>> be > >>> >>>>>>>> aware of this), those who are more familiar with the project > >>> >>>>>>>> can > >>> >>>>>>>> help by > >>> >>>>>>>> emailing the dev@ when they see something that hasn't been. > >>> >>>>>>>> > >>> >>>>>>>> - Culture: The design doc author(s) should be open to > feedback. > >>> >>>>>>>> A > >>> >>>>>>>> design > >>> >>>>>>>> doc should serve as the base for discussion and is by no means > >>> >>>>>>>> the > >>> >>>>>>>> final > >>> >>>>>>>> design. Of course, this does not mean the author has to accept > >>> >>>>>>>> every > >>> >>>>>>>> feedback. They should also be comfortable accepting / > rejecting > >>> >>>>>>>> ideas on > >>> >>>>>>>> technical grounds. > >>> >>>>>>>> > >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can be > >>> >>>>>>>> useful > >>> >>>>>>>> to > >>> >>>>>>>> have > >>> >>>>>>>> some monthly Google hangouts that are open to the world. I am > >>> >>>>>>>> actually not > >>> >>>>>>>> sure how well this will work, because of the volunteering > nature > >>> >>>>>>>> and > >>> >>>>>>>> we need > >>> >>>>>>>> to adjust for timezones for people across the globe, but it > >>> >>>>>>>> seems > >>> >>>>>>>> worth > >>> >>>>>>>> trying. > >>> >>>>>>>> > >>> >>>>>>>> - Culture: Contributors (including committers) should be more > >>> >>>>>>>> direct > >>> >>>>>>>> in > >>> >>>>>>>> setting expectations, including whether they are working on a > >>> >>>>>>>> specific > >>> >>>>>>>> issue, whether they will be working on a specific issue, and > >>> >>>>>>>> whether > >>> >>>>>>>> an > >>> >>>>>>>> issue or pr or jira should be rejected. Most people I know in > >>> >>>>>>>> this > >>> >>>>>>>> community > >>> >>>>>>>> are nice and don't enjoy telling other people no, but it is > >>> >>>>>>>> often > >>> >>>>>>>> more > >>> >>>>>>>> annoying to a contributor to not know anything than getting a > >>> >>>>>>>> no. > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia > >>> >>>>>>>> <[hidden email]> > >>> >>>>>>>> wrote: > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement Proposal" > >>> >>>>>>>>> process that > >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I don't > >>> >>>>>>>>> think > >>> >>>>>>>>> committers are trying to minimize their own work -- every > >>> >>>>>>>>> committer > >>> >>>>>>>>> cares > >>> >>>>>>>>> about making the software useful for users. However, it is > >>> >>>>>>>>> always > >>> >>>>>>>>> hard to > >>> >>>>>>>>> get user input and so it helps to have this kind of process. > >>> >>>>>>>>> I've > >>> >>>>>>>>> certainly > >>> >>>>>>>>> looked at the *IPs a lot in other software I use just to see > >>> >>>>>>>>> the > >>> >>>>>>>>> biggest > >>> >>>>>>>>> things on the roadmap. > >>> >>>>>>>>> > >>> >>>>>>>>> When you're talking about "changing interfaces", are you > >>> >>>>>>>>> talking > >>> >>>>>>>>> about > >>> >>>>>>>>> public or internal APIs? I do think many people hate changing > >>> >>>>>>>>> public APIs > >>> >>>>>>>>> and I actually think that's for the best of the project. > That's > >>> >>>>>>>>> a > >>> >>>>>>>>> technical > >>> >>>>>>>>> debate, but basically, the worst thing when you're using a > >>> >>>>>>>>> piece > >>> >>>>>>>>> of > >>> >>>>>>>>> software > >>> >>>>>>>>> is that the developers constantly ask you to rewrite your app > >>> >>>>>>>>> to > >>> >>>>>>>>> update to a > >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue > anyone > >>> >>>>>>>>> who's used > >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change their > >>> >>>>>>>>> code > >>> >>>>>>>>> this > >>> >>>>>>>>> release" model works well within a single large company, but > >>> >>>>>>>>> doesn't work > >>> >>>>>>>>> well for a community, which is why nearly all *very* widely > >>> >>>>>>>>> used > >>> >>>>>>>>> programming > >>> >>>>>>>>> interfaces (I'm talking things like Java standard library, > >>> >>>>>>>>> Windows > >>> >>>>>>>>> API, etc) > >>> >>>>>>>>> almost *never* break backwards compatibility. All this is > done > >>> >>>>>>>>> within reason > >>> >>>>>>>>> though, e.g. we do change things in major releases (2.x, 3.x, > >>> >>>>>>>>> etc). > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > - > >>> >>>>>> To unsubscribe e-mail: [hidden email] > >>> >>>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> -- > >>> >>>>> Stavros Kontopoulos > >>> >>>>> Senior Software Engineer > >>> >>>>> Lightbend, Inc. > >>> >>>>> p: +30 6977967274 > >>> >>>>> e: [hidden email] > >>> >>>>> > >>> >>>>> > >>> >>>> > >>> >>> > >>> >> > >>> >> > >>> > >> > > > > > > - > > To unsubscribe e-mail: [hidden email] > > > > > > > > > > If you reply to this email, your message will be added to the discussion > > below: > > > > http://apache-spark-developers-list.1001551.n3. > nabble.com/Spark-Improvement-Proposals-tp19268p19359.html > > > > To start a new topic under Apache Spark Developers List, email [hidden > > email] > > To unsubscribe from Apache Spark Developers List, click here. > > NAML > > > > > > > > View this message in context: RE: Spark Improvement Proposals > > Sent from the Apache Spark Developers List mailing list archive at > > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Spark Improvement Proposals

2016-10-10 Thread Ryan Blue
Sorry, I missed that the proposal includes majority approval. Why majority instead of consensus? I think we want to build consensus around these proposals and it makes sense to discuss until no one would veto. rb On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue wrote: > +1 to votes to appr

Re: Spark Improvement Proposals

2016-10-10 Thread Ryan Blue
is better, I > don't care. Again, I don't feel strongly about the way we achieve > clarity, just that we achieve clarity. > > On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue wrote: > > Sorry, I missed that the proposal includes majority approval. Why > majority > >

Re: Spark Improvement Proposals

2016-10-11 Thread Ryan Blue
;>>> reality. Beyond that, if people think it's more open to allow formal > >>>> proposals from anyone, I'm not necessarily against it, but my main > >>>> question would be this: > >>>> > >>>> If anyone can submit a proposa

Re: source for org.spark-project.hive:1.2.1.spark2

2016-10-14 Thread Ryan Blue
38-deee-476a-93ff-92fead06e...@hortonworks.com%3E] > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: source for org.spark-project.hive:1.2.1.spark2

2016-10-17 Thread Ryan Blue
Are these changes that the Hive community has rejected? I don't see a compelling reason to have a long-term Spark fork of Hive. rb On Sat, Oct 15, 2016 at 5:27 AM, Steve Loughran wrote: > > On 15 Oct 2016, at 01:28, Ryan Blue wrote: > > The Spark 2 branch is based o

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Ryan Blue
t; to handle for example Option[Set[Int]], but it really >> cannot handle Set so it leads to a runtime exception. >> >> would it be useful to make this a little more specific? i guess the >> challenge is going to be case classes which unfortunately dont extend >> Product1, Product2, etc. >> > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-28 Thread Ryan Blue
version as 2.0.3, rather than 2.0.2. If a new RC > (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2. > > > -- Ryan Blue Software Engineer Netflix

Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Ryan Blue
; >>> > something needs to change. > >>> > > >>> > - We need a clear process for planning significant changes to the > >>> > codebase. > >>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly, > >>> > but you need a do

Re: Updating Parquet dep to 1.9

2016-11-01 Thread Ryan Blue
library dep to 1.9? If not, >> I can at least get started on it and publish a PR. >> >> Cheers, >> >> Michael >> ----- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

Re: Updating Parquet dep to 1.9

2016-11-01 Thread Ryan Blue
Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue > wrote: > >> 1.9.0 includes some fixes intended specifically for Spark: >> >> * PARQUET-389: Evaluates push-down predicates for missing columns as >> though they are null. This is to address Spark's work-around that requires

Re: Updating Parquet dep to 1.9

2016-11-02 Thread Ryan Blue
github.com/apache/spark/pull/15538 needs to make > it into 2.1. The logging output issue is really bad. I would probably call > it a blocker. > > Michael > > > On Nov 1, 2016, at 1:22 PM, Ryan Blue wrote: > > I can when I'm finished with a couple other issues if

Re: Spark Improvement Proposals

2016-11-08 Thread Ryan Blue
t; https://cwiki.apache.org/confluence/display/FLINK/ > Flink+Internals > >>> >>> >>> > >>> >>> >>> Spark is no longer an engine that works for micro-batch and > >>> >>> >>> batch...We > >>> >>> >>> (and > >>> >>> >>> I am sure many others) are pushing spark as an engine for > stream > >>> >>> >>> and > >>> >>> >>> query > >>> >>> >>> processing.we need to make it a state-of-the-art engine for > >>> >>> >>> high > >>> >>> >>> speed > >>> >>> >>> streaming data and user queries as well ! > >>> >>> >>> > >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda > >>> >>> >>> > >>> >>> >>> wrote: > >>> >>> >>>> > >>> >>> >>>> Hi everyone, > >>> >>> >>>> > >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may > >>> >>> >>>> help a > >>> >>> >>>> little bit. :) Many technical and organizational topics were > >>> >>> >>>> mentioned, > >>> >>> >>>> but I want to focus on these negative posts about Spark and > >>> >>> >>>> about > >>> >>> >>>> "haters" > >>> >>> >>>> > >>> >>> >>>> I really like Spark. Easy of use, speed, very good community - > >>> >>> >>>> it's > >>> >>> >>>> everything here. But Every project has to "flight" on > "framework > >>> >>> >>>> market" > >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data > >>> >>> >>>> communities, > >>> >>> >>>> maybe my mail will inspire someone :) > >>> >>> >>>> > >>> >>> >>>> You (every Spark developer; so far I didn't have enough time > to > >>> >>> >>>> join > >>> >>> >>>> contributing to Spark) has done excellent job. So why are some > >>> >>> >>>> people > >>> >>> >>>> saying that Flink (or other framework) is better, like it was > >>> >>> >>>> posted > >>> >>> >>>> in > >>> >>> >>>> this mailing list? No, not because that framework is better in > >>> >>> >>>> all > >>> >>> >>>> cases.. In my opinion, many of these discussions where started > >>> >>> >>>> after > >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow > "Flink > >>> >>> >>>> vs > >>> >>> >>>> " > >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are > >>> >>> >>>> sometimes > >>> >>> >>>> saying nothing about other frameworks, Flink's users (often > >>> >>> >>>> PMC's) > >>> >>> >>>> are > >>> >>> >>>> just posting same information about real-time streaming, about > >>> >>> >>>> delta > >>> >>> >>>> iterations, etc. It look smart and very often it is marked as > an > >>> >>> >>>> aswer, > >>> >>> >>>> even if - in my opinion - there wasn't told all the truth. > >>> >>> >>>> > >>> >>> >>>> > >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to > >>> >>> >>>> perform > >>> >>> >>>> huge > >>> >>> >>>> performance test. Maybe some company, that supports Spark > >>> >>> >>>> (Databricks, > >>> >>> >>>> Cloudera? - just saying you're most visible in community :) ) > >>> >>> >>>> could > >>> >>> >>>> perform performance test of: > >>> >>> >>>> > >>> >>> >>>> - streaming engine - probably Spark will loose because of > >>> >>> >>>> mini-batch > >>> >>> >>>> model, however currently the difference should be much lower > >>> >>> >>>> that in > >>> >>> >>>> previous versions > >>> >>> >>>> > >>> >>> >>>> - Machine Learning models > >>> >>> >>>> > >>> >>> >>>> - batch jobs > >>> >>> >>>> > >>> >>> >>>> - Graph jobs > >>> >>> >>>> > >>> >>> >>>> - SQL queries > >>> >>> >>>> > >>> >>> >>>> People will see that Spark is envolving and is also a modern > >>> >>> >>>> framework, > >>> >>> >>>> because after reading posts mentioned above people may think > "it > >>> >>> >>>> is > >>> >>> >>>> outdated, future is in framework X". > >>> >>> >>>> > >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark > >>> >>> >>>> Structured > >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use > >>> >>> >>>> and > >>> >>> >>>> reliability. Performance tests, done in various environments > (in > >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, > 20-node > >>> >>> >>>> cluster), could be also very good marketing stuff to say "hey, > >>> >>> >>>> you're > >>> >>> >>>> telling that you're better, but Spark is still faster and is > >>> >>> >>>> still > >>> >>> >>>> getting even more fast!". This would be based on facts (just > >>> >>> >>>> numbers), > >>> >>> >>>> not opinions. It would be good for companies, for marketing > >>> >>> >>>> puproses > >>> >>> >>>> and > >>> >>> >>>> for every Spark developer > >>> >>> >>>> > >>> >>> >>>> > >>> >>> >>>> Second: real-time streaming. I've written some time ago about > >>> >>> >>>> real-time > >>> >>> >>>> streaming support in Spark Structured Streaming. Some work > >>> >>> >>>> should be > >>> >>> >>>> done to make SSS more low-latency, but I think it's possible. > >>> >>> >>>> Maybe > >>> >>> >>>> Spark may look at Gearpump, which is also built on top of > Akka? > >>> >>> >>>> I > >>> >>> >>>> don't > >>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark > >>> >>> >>>> should > >>> >>> >>>> have real-time streaming support. Currently I see many > >>> >>> >>>> posts/comments > >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing > very > >>> >>> >>>> good > >>> >>> >>>> jobs with micro-batches, however I think it is possible to add > >>> >>> >>>> also > >>> >>> >>>> more > >>> >>> >>>> real-time processing. > >>> >>> >>>> > >>> >>> >>>> Other people said much more and I agree with proposal of SIP. > >>> >>> >>>> I'm > >>> >>> >>>> also > >>> >>> >>>> happy that PMC's are not saying that they will not listen to > >>> >>> >>>> users, > >>> >>> >>>> but > >>> >>> >>>> they really want to make Spark better for every user. > >>> >>> >>>> > >>> >>> >>>> > >>> >>> >>>> What do you think about these two topics? Especially I'm > looking > >>> >>> >>>> at > >>> >>> >>>> Cody > >>> >>> >>>> (who has started this topic) and PMCs :) > >>> >>> >>>> > >>> >>> >>>> Pozdrawiam / Best regards, > >>> >>> >>>> > >>> >>> >>>> Tomasz > >>> >>> >>>> > >>> >>> >>>> > >>> >>> > >>> >> > >>> > > >>> > > > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-09 Thread Ryan Blue
p://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ >>>> >>>> >>>> Q: How can I help test this release? >>>> A: If you are a Spark user, you can help us test this release by taking >>>> an existing Spark workload and running on this release candidate, then >>>> reporting any regressions from 2.0.1. >>>> >>>> Q: What justifies a -1 vote for this release? >>>> A: This is a maintenance release in the 2.0.x series. Bugs already >>>> present in 2.0.1, missing features, or bugs related to new features will >>>> not necessarily block this release. >>>> >>>> Q: What fix version should I use for patches merging into branch-2.0 >>>> from now on? >>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC >>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Vaquar Khan >>>> +1 -224-436-0783 <(224)%20436-0783> >>>> >>>> IT Architect / Lead Consultant >>>> Greater Chicago >>>> >>> >> > -- Ryan Blue Software Engineer Netflix

Re: OutOfMemoryError on parquet SnappyDecompressor

2016-11-21 Thread Ryan Blue
onfun$productToRowRdd$1. >> apply(basicOperators.scala:220) >> > >> > >> > >> > >> > >> >> > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1. >> apply(basicOperators.scala:219) >> > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) >> > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) >> > >> > >> > >> > >> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >> >> > >> > >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> > >> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> > >> > >> > >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) >> > >> > org.apache.spark.scheduler.Task.run(Task.scala:54) >> > >> > >> > >> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) >> >> > >> > >> > >> > >> > >> >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> >> > >> > >> > >> > >> > >> >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> >> > >> > java.lang.Thread.run(Thread.java:722) >> > >> > >> > >> > >> > >> > >> > >> >> > > >> > > >> > >> >> >> -- >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-spark-developers-list.1001551.n3. >> nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor- >> tp8517p8528.html >> To start a new topic under Apache Spark Developers List, email [hidden >> email] <http:///user/SendEmail.jtp?type=node&node=19965&i=1> >> To unsubscribe from Apache Spark Developers List, click here. >> NAML >> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> > > -- > View this message in context: Re: OutOfMemoryError on parquet > SnappyDecompressor > <http://apache-spark-developers-list.1001551.n3.nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-tp8517p19965.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > -- Ryan Blue Software Engineer Netflix

Re: OutOfMemoryError on parquet SnappyDecompressor

2016-11-21 Thread Ryan Blue
; > Thanks, > Aniket > > On Mon, Nov 21, 2016, 3:24 PM Ryan Blue [via Apache Spark Developers List] > <[hidden email] <http:///user/SendEmail.jtp?type=node&node=19973&i=0>> > wrote: > >> Aniket, >> >> The solution was to add a sort so that o

Re: Forking or upgrading Apache Parquet in Spark

2016-12-15 Thread Ryan Blue
umber 2 because we will use Apache > Parquet 1.8.1 for a while. > > Bests, > Dongjoon. > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread Ryan Blue
List mailing list archive at > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Apache Hive with Spark Configuration

2017-01-03 Thread Ryan Blue
ll me which > version is more compatible with Spark 2.0.2 ? > > THanks > -- Ryan Blue Software Engineer Netflix

Parquet patch release

2017-01-06 Thread Ryan Blue
1701.mbox/%3CCAO4re1mnWJ3%3Di0NpUmPU%2BwD8G%3DsG_%2BAA2PsFBzZv%3DwrUR1529g%40mail.gmail.com%3E> on the Parquet dev list. If you're interested in reviewing what goes into 1.8.2 or have suggestions, please follow that thread on the Parquet list. Thanks! rb -- Ryan Blue Software Engineer Netflix

Re: [SQL][PYTHON] UDF improvements.

2017-01-09 Thread Ryan Blue
( > https://gist.github.com/zero323/88953975361dbb6afd639b35368a97b4) and > I'll be happy to open a JIRA and submit a PR if there is any interest in > that. > > -- > Best, > Maciej > > -- Ryan Blue Software Engineer Netflix

Re: Is it possible to get a job end kind of notification on the executor (slave)

2017-01-20 Thread Ryan Blue
ssListener but it does not seem to work in a cluster. The > event is not triggered in the worker. > > Regards, > Keith. > > http://keith-chapman.com > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Release Apache Parquet 1.8.2 RC1

2017-01-24 Thread Ryan Blue
at 11:40 AM, Julien Le Dem >> wrote: >> >>> +1 >>> Followed: https://cwiki.apache.org/confluence/display/PARQUET/How+To+V >>> erify+A+Release >>> checked sums, ran the build and tests. >>> We would appreciate someone from the Spark project

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-10 Thread Ryan Blue
java.lang.String.(String.java:207) at > java.lang.StringBuilder.toString(StringBuilder.java:407) at > scala.collection.mutable.StringBuilder.toString(StringBuilder.scala:430) > at org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:101) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:55) > at java.util.TimerThread.mainLoop(Timer.java:555) at > java.util.TimerThread.run(Timer.java:505) > > -- Ryan Blue Software Engineer Netflix

Re: Add hive-site.xml at runtime

2017-02-13 Thread Ryan Blue
ark.apache.org and my mail was bouncing each > time so Sean Owen suggested to mail dev.(https://issues.apache. > org/jira/browse/SPARK-19546). Please give solution to above ticket also > if possible. > > Thanks > > -- > Shivam Sharma > -- Ryan Blue Software Engineer Netflix

Re: Need Help: getting java.lang.OutOfMemory Error : GC overhead limit exceeded (TransportChannelHandler)

2017-02-15 Thread Ryan Blue
354) > at io.netty.util.concurrent.SingleThreadEventExecutor$2. > run(SingleThreadEventExecutor.java:116) > at java.lang.Thread.run(Thread.java:745) > > 2017-02-15 14:50:14,692 WARN > org.apache.spark.network.server.TransportChannelHandler: > Exception in connection from /10.154.16.74:58547 > java.lang.OutOfMemoryError: GC overhead limit exceeded > > Thanks > Naresh > > -- Ryan Blue Software Engineer Netflix

Re: Spark Improvement Proposals

2017-02-16 Thread Ryan Blue
nical >>>>>>> quality of the SPIP: this person need not be a champion for the SPIP or >>>>>>> contribute to it, but rather makes sure it stands a chance of being >>>>>>> approved when the vote happens. Also, if the author cannot find an

Re: Spark Improvement Proposals

2017-02-16 Thread Ryan Blue
;>>> Even if some PRs are merged, sometimes, we still have to revert them > >>>>>> back, if the design and implementation are not reviewed carefully. > We have > >>>>>> to ensure our quality. Spark is not an application software. It is > an > >>>>>> infrastructure software that is being used by many many companies. > We have > >>>>>> to be very careful in the design and implementation, especially > >>>>>> adding/changing the external APIs. > >>>>>> > >>>>>> > >>>>>> When I developed the Mainframe infrastructure/middleware software in > >>>>>> the past 6 years, I were involved in the discussions with > external/internal > >>>>>> customers. The to-do feature list was always above 100. Sometimes, > the > >>>>>> customers are feeling frustrated when we are unable to deliver them > on time > >>>>>> due to the resource limits and others. Even if they paid us > billions, we > >>>>>> still need to do it phase by phase or sometimes they have to accept > the > >>>>>> workarounds. That is the reality everyone has to face, I think. > >>>>>> > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> > >>>>>> Xiao Li > >>>>>>> > >>>>>>> > >> > > > > - > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Output Committers for S3

2017-02-20 Thread Ryan Blue
1 is also welcome. > > > > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Output-Committers- > for-S3-tp21033.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Will .count() always trigger an evaluation of each row?

2017-02-20 Thread Ryan Blue
ist, click here. > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > -- > View this message in context: RE: Will .count() always trigger an > evaluation of each row? > <http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21027.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > -- Ryan Blue Software Engineer Netflix

Re: Output Committers for S3

2017-02-21 Thread Ryan Blue
On Tue, Feb 21, 2017 at 6:15 AM, Steve Loughran wrote: > On 21 Feb 2017, at 01:00, Ryan Blue wrote: > > You'd have to encode the task ID in the output file name to identify files > > to roll back in the event you need to revert a task, but if you have > > partitione

Re: Output Committers for S3

2017-02-21 Thread Ryan Blue
1.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > -- Ryan Blue Software Engineer Netflix - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark Improvement Proposals

2017-02-27 Thread Ryan Blue
everything >> depends >> > only on shepherd . >> > >> > Also want to add point that SPIP should be time bound with define SLA >> else >> > will defeats purpose. >> > >> > >> > Regards, >> > Vaquar khan >> > &g

Re: RFC: deprecate SparkStatusTracker, remove JobProgressListener

2017-03-24 Thread Ryan Blue
add a deprecated annotation to it? > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Output Committers for S3

2017-03-28 Thread Ryan Blue
etOutputCommitter > >at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2221) > >... 28 more > > > > can you please point out my mistake. > > > > If possible can you give a working example of saving a dataframe as a > > parquet file in s3. > > > > > > > > > > > > > > > > -- > > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Output-Committers- > for-S3-tp21033p21246.html > > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > > > - > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-07 Thread Ryan Blue
pFsRelationCommand.scala:149) > at org.apache.spark.sql.execution.datasources. > InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply( > InsertIntoHadoopFsRelationCommand.scala:115) > > {logs} > > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-10 Thread Ryan Blue
gt;>>> >>>>> The staging repository for this release can be found at: >>>>> https://repository.apache.org/content/repositories/orgapache >>>>> spark-1227/ >>>>> >>>>> The documentation corresponding to this release can be found at: >>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1. >>>>> 1-rc2-docs/ >>>>> >>>>> >>>>> *FAQ* >>>>> >>>>> *How can I help test this release?* >>>>> >>>>> If you are a Spark user, you can help us test this release by taking >>>>> an existing Spark workload and running on this release candidate, then >>>>> reporting any regressions. >>>>> >>>>> *What should happen to JIRA tickets still targeting 2.1.1?* >>>>> >>>>> Committers should look at those and triage. Extremely important bug >>>>> fixes, documentation, and API tweaks that impact compatibility should be >>>>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0. >>>>> >>>>> *But my bug isn't fixed!??!* >>>>> >>>>> In order to make timely releases, we will typically not hold the >>>>> release unless the bug in question is a regression from 2.1.0. >>>>> >>>>> *What happened to RC1?* >>>>> >>>>> There were issues with the release packaging and as a result was >>>>> skipped. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Cell : 425-233-8271 <(425)%20233-8271> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> >>>>> >>>>> -- >>>> Cell : 425-233-8271 <(425)%20233-8271> >>>> Twitter: https://twitter.com/holdenkarau >>>> >>> >>> >>> >>> -- >>> Cell : 425-233-8271 <(425)%20233-8271> >>> Twitter: https://twitter.com/holdenkarau >>> >> >> >> >> -- >> Cell : 425-233-8271 <(425)%20233-8271> >> Twitter: https://twitter.com/holdenkarau >> > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Ryan Blue
gt;>>>> packaging for RC3 since I'v been poking around in Jenkins a bit (for >>>>> SPARK-20216 >>>>> & friends) (I'd still probably need some guidance from a previous release >>>>> coordinator so I understand if that

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
ve set spark.task.maxFailures to 8 for my job. Seems > like all task retries happen on the same slave in case of failure. My > expectation was that task will be retried on different slave in case of > failure, and chance of all 8 retries to happen on same slave is very less. > > > Regards >

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
stage. In that version, you probably want to set spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs <http://spark.apache.org/docs/latest/configuration.html> and search for “blacklist” to see all the options. rb ​ On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue wrote: > Chawl

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
orresponding to this release can be found at: >>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/ >>>> >>>> >>>> *FAQ* >>>> >>>> *How can I help test this release?* >>>> >>>> If you are a Spark user, you can help us test this release by taking an >>>> existing Spark workload and running on this release candidate, then >>>> reporting any regressions. >>>> >>>> *What should happen to JIRA tickets still targeting 2.2.0?* >>>> >>>> Committers should look at those and triage. Extremely important bug >>>> fixes, documentation, and API tweaks that impact compatibility should be >>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. >>>> >>>> *But my bug isn't fixed!??!* >>>> >>>> In order to make timely releases, we will typically not hold the >>>> release unless the bug in question is a regression from 2.1.1. >>>> >>> >> -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
ParquetAvroOutputFormat from a application running on Spark 2.2.0. > > Regards, > > Frank Austin Nothaft > fnoth...@berkeley.edu > fnoth...@eecs.berkeley.edu > 202-340-0466 <(202)%20340-0466> > > On May 1, 2017, at 10:02 AM, Ryan Blue > wrote: > > I agree with

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
t; fnoth...@berkeley.edu >> fnoth...@eecs.berkeley.edu >> 202-340-0466 <(202)%20340-0466> >> >> On May 1, 2017, at 11:31 AM, Ryan Blue wrote: >> >> Frank, >> >> The issue you're running into is caused by using parquet-avro with Avro >>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
xpects to > find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead. Spark > already has to work around this for unit tests to pass. > > > > On Mon, May 1, 2017 at 2:00 PM, Ryan Blue wrote: > >> Thanks for the extra context, Frank. I agree that it sounds like you

Re: Parquet vectorized reader DELTA_BYTE_ARRAY

2017-05-22 Thread Ryan Blue
1.n3.nabble.com/Parquet-vectorized- > reader-DELTA-BYTE-ARRAY-tp21538.html > > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > > > - > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Are release docs part of a release?

2017-06-08 Thread Ryan Blue
A (SPARK-20507) are not *actually* critical >> as the project website certainly can be updated separately from the source >> code guide and is not part of the release to be voted on. In future that >> particular work item for the QA process could be marked down in priority, >>

Re: Output Committers for S3

2017-06-19 Thread Ryan Blue
(maybe > ryan or steve can confirm this assumption) not applicable to the Netflix > commiter uploaded by Ryan blue. Because Ryan's commiter uses multipart > upload. So either the whole file is live or nothing is. partial data will > not be available for read. Whatever partial data that

Re: [VOTE] [SPIP] SPARK-18085: Better History Server scalability

2017-08-01 Thread Ryan Blue
't think this is a good idea because of the following > technical reasons. > > Thanks! > > -- > Marcelo > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Ryan Blue
gt;> >> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8 >> >> Do you think below setting can help me to overcome above issue: >> >> spark.default.parellism=1000 >> spark.sql.shuffle.partitions=1000 >> >> Because default max number of partitions are 1000. >> >> >> > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Ryan Blue
;>>>>> some of the time, and filter B some of the time. If I’m passed in both, >>>>>>> then either A and B are unhandled, or A, or B, or neither. The work I >>>>>>> have >>>>>>> to do to work this out is essentially the same as I have to do while >>>>>>> actually generating my RDD (essentially I have to generate my >>>>>>> partitions), >>>>>>> so I end up doing some weird caching work. >>>>>>> >>>>>>> This V2 API proposal has the same issues, but perhaps moreso. In >>>>>>> PrunedFilteredScan, there is essentially one degree of freedom for >>>>>>> pruning >>>>>>> (filters), so you just have to implement caching between >>>>>>> unhandledFilters >>>>>>> and buildScan. However, here we have many degrees of freedom; sorts, >>>>>>> individual filters, clustering, sampling, maybe aggregations eventually >>>>>>> - >>>>>>> and these operations are not all commutative, and computing my support >>>>>>> one-by-one can easily end up being more expensive than computing all in >>>>>>> one >>>>>>> go. >>>>>>> >>>>>>> For some trivial examples: >>>>>>> >>>>>>> - After filtering, I might be sorted, whilst before filtering I >>>>>>> might not be. >>>>>>> >>>>>>> - Filtering with certain filters might affect my ability to push >>>>>>> down others. >>>>>>> >>>>>>> - Filtering with aggregations (as mooted) might not be possible to >>>>>>> push down. >>>>>>> >>>>>>> And with the API as currently mooted, I need to be able to go back >>>>>>> and change my results because they might change later. >>>>>>> >>>>>>> Really what would be good here is to pass all of the filters and >>>>>>> sorts etc all at once, and then I return the parts I can’t handle. >>>>>>> >>>>>>> I’d prefer in general that this be implemented by passing some kind >>>>>>> of query plan to the datasource which enables this kind of replacement. >>>>>>> Explicitly don’t want to give the whole query plan - that sounds >>>>>>> painful - >>>>>>> would prefer we push down only the parts of the query plan we deem to be >>>>>>> stable. With the mix-in approach, I don’t think we can guarantee the >>>>>>> properties we want without a two-phase thing - I’d really love to be >>>>>>> able >>>>>>> to just define a straightforward union type which is our supported >>>>>>> pushdown >>>>>>> stuff, and then the user can transform and return it. >>>>>>> >>>>>>> I think this ends up being a more elegant API for consumers, and >>>>>>> also far more intuitive. >>>>>>> >>>>>>> James >>>>>>> >>>>>>> On Mon, 28 Aug 2017 at 18:00 蒋星博 wrote: >>>>>>> >>>>>>>> +1 (Non-binding) >>>>>>>> >>>>>>>> Xiao Li 于2017年8月28日 周一下午5:38写道: >>>>>>>> >>>>>>>>> +1 >>>>>>>>> >>>>>>>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger : >>>>>>>>> >>>>>>>>>> Just wanted to point out that because the jira isn't labeled >>>>>>>>>> SPIP, it >>>>>>>>>> won't have shown up linked from >>>>>>>>>> >>>>>>>>>> http://spark.apache.org/improvement-proposals.html >>>>>>>>>> >>>>>>>>>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan >>>>>>>>>> wrote: >>>>>>>>>> > Hi all, >>>>>>>>>> > >>>>>>>>>> > It has been almost 2 weeks since I proposed the data source V2 >>>>>>>>>> for >>>>>>>>>> > discussion, and we already got some feedbacks on the JIRA >>>>>>>>>> ticket and the >>>>>>>>>> > prototype PR, so I'd like to call for a vote. >>>>>>>>>> > >>>>>>>>>> > The full document of the Data Source API V2 is: >>>>>>>>>> > https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ- >>>>>>>>>> Z8qU5Frf6WMQZ6jJVM/edit >>>>>>>>>> > >>>>>>>>>> > Note that, this vote should focus on high-level >>>>>>>>>> design/framework, not >>>>>>>>>> > specified APIs, as we can always change/improve specified APIs >>>>>>>>>> during >>>>>>>>>> > development. >>>>>>>>>> > >>>>>>>>>> > The vote will be up for the next 72 hours. Please reply with >>>>>>>>>> your vote: >>>>>>>>>> > >>>>>>>>>> > +1: Yeah, let's go forward and implement the SPIP. >>>>>>>>>> > +0: Don't really care. >>>>>>>>>> > -1: I don't think this is a good idea because of the following >>>>>>>>>> technical >>>>>>>>>> > reasons. >>>>>>>>>> > >>>>>>>>>> > Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>> >>>> >> -- Ryan Blue Software Engineer Netflix

<    1   2   3   4   5   >