ate existing data.
>>> 3. when reading data, fill the missing column with the initial default
>>> value
>>> 4. when writing data, fill the missing column with the latest default
>>> value
>>> 5. when altering a column to change its default va
e is
> decided by the end-users.
>
> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue wrote:
>
>> Wenchen, can you give more detail about the different ADD COLUMN syntax?
>> That sounds confusing to end users to me.
>>
>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan
with my proposal that we should follow RDBMS/SQL standard
>> regarding the behavior?
>>
>> > pass the default through to the underlying data source
>>
>> This is one way to implement the behavior.
>>
>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue wrote:
&g
table schema during writing.
> Users can use native client of data source to change schema.
> >
> > On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue wrote:
> >>
> >> I think it is good to know that not all sources support default values.
> That makes me think that we s
rce supports.
>>> >
>>> > Following this direction, it makes more sense to delegate everything
>>> to data sources.
>>> >
>>> > As the first step, maybe we should not add DDL commands to change
>>> schema of data source, but just use th
g this can speed things up to the tune of 2-6%. Has anyone
>> considered this before?
>>
>> Sean
>>
>> -----
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Software Engineer
Netflix
an also talk about the user-facing API
<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.cgnrs9vys06x>
proposed in the SPIP.
Thanks,
rb
--
Ryan Blue
Software Engineer
Netflix
Here are my notes from the DSv2 sync last night.
*As usual, I didn’t take great notes because I was participating in the
discussion. Feel free to send corrections or clarification.*
*Attendees*:
Ryan Blue
John Zhuge
Xiao Li
Reynold Xin
Felix Cheung
Anton Okolnychyi
Bruce Robbins
Dale Richardson
tables and not nested namespaces. How would Spark handle
arbitrary nesting that differs across catalogs?
Hopefully, I’ve captured the design question well enough for a productive
discussion. Thanks!
rb
--
Ryan Blue
Software Engineer
Netflix
support
path-based tables by adding a path to CatalogIdentifier, either as a
namespace or as a separate optional string. Then, the identifier passed to
a catalog would work for either a path-based table or a catalog table,
without needing a path-based catalog API.
Thoughts?
On Sun, Jan 13, 2019 at 1:
y review.
>>
>> The first PR <https://github.com/apache/spark/pull/23552> does not
>> contain the changes of hive-thriftserver. Please ignore the failed test in
>> hive-thriftserver.
>>
>> The second PR <https://github.com/apache/spark/pull/23553> is complete
>> changes.
>>
>>
>>
>> I have created a Spark distribution for Apache Hadoop 2.7, you might
>> download it via Google Drive
>> <https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> or Baidu
>> Pan <https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
>>
>> Please help review and test. Thanks.
>>
>
--
Ryan Blue
Software Engineer
Netflix
;
>> And we are super 100% dependent on Hive...
>>
>>
>> --
>> *From:* Ryan Blue
>> *Sent:* Tuesday, January 15, 2019 9:53 AM
>> *To:* Xiao Li
>> *Cc:* Yuming Wang; dev
>> *Subject:* Re: [DISCUSS] Upgrade built-in Hi
s
> > we're going to stay in 1.2.x for, at least, a long time (say .. until
> Spark 4.0.0?).
> >
> > I know somehow it happened to be sensitive but to be just literally
> honest to myself, I think we should make a try.
> >
>
>
> --
> Marcelo
>
--
Ryan Blue
Software Engineer
Netflix
Any discussion on how Spark should manage identifiers when multiple
catalogs are supported?
I know this is an area where a lot of people are interested in making
progress, and it is a blocker for both multi-catalog support and CTAS in
DSv2.
On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote:
>
whole scheme will need to play nice with column identifier as
> well.
>
>
>
>
> --
>
> *From:* Ryan Blue
> *Sent:* Thursday, January 17, 2019 11:38 AM
> *To:* Spark Dev List
> *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support
&
- Ryan: next time, we should talk about the set of metadata proposed
for TableCatalog, but we’re out of time.
*Attendees*:
Ryan Blue
John Zhuge
Reynold Xin
Xiao Li
Dongjoon Hyun
Eric Wohlstadter
Hyukjin Kwon
Jacky Lee
Jamison Bennett
Kevin Yu
Yuanjian Li
Maryann Xue
Matt Cheah
Dale Richards
timeout. Perhaps
> is the broadcast timeout really meant to be a timeout on
> sparkContext.broadcast, instead of the child.executeCollectIterator()? In
> that case, would it make sense to move the timeout to wrap only
> sparkContext.broadcast?
>
> Best,
>
> Justin
>
--
Ryan Blue
Software Engineer
Netflix
t ongoing discussion. From the feedback in the
DSv2 sync and on the previous thread, I think it should go quickly.
Thanks for taking a look at the proposal,
rb
--
Ryan Blue
; > Moein
> >
> > --
> >
> > Moein Hosseini
> > Data Engineer
> > mobile: +98 912 468 1859
> > site: www.moein.xyz
> > email: moein...@gmail.com
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
data, and partitioning is already
supported. The idea to use conditions to create separate data frames would
actually make that harder because you'd need to create and name tables for
each one.
On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo wrote:
> Hello Ryan,
>
> On Mon, Feb 4, 2019 at
r.write: " + record.get(0,
> DataTypes.DateType));
>
> }
>
> It prints an integer as output:
>
> MyDataWriter.write: 17039
>
>
> Is this a bug? or I am doing something wrong?
>
> Thanks,
> Shubham
>
--
Ryan Blue
Software Engineer
Netflix
ecause in real case end users would put more files then only
> stdout and stderr (like gc logs).
>
> SPARK-23155 provides the way to modify log URL but it's only applied to
> SHS, and in Spark UI in running apps it still only shows "stdout" and
> "stderr". SPARK-26792 is for applying this to Spark UI as well, but I've
> got suggestion to just change the default log URL.
>
> Thanks again,
> Jungtaek Lim (HeartSaVioR)
>
--
Ryan Blue
Software Engineer
Netflix
ers have to remove file part manually from URL to
> access list page. Instead of this we may be able to change default URL to
> show all of local logs and let users choose which file to read. (though it
> would be two-clicks to access to actual file)
>
> -Jungtaek Lim (HeartSaVioR)
&
uring debugging. So linking the YARN container log overview
> > page would make much more sense for us. We work it around with a custom
> > submit process that logs all important URLs on the submit side log.
> >
> >
> >
> > 2019년 2월 9일 (토) 오전 5:42, Ryan Blue 님이 작성:
&
urning on/off flag option to just get one url or
> default two stdout/stderr urls.
> 3. We could let users enumerate file names they want to link, and create
> log links for each file.
>
> Which one do you suggest?
>
> 2019년 2월 9일 (토) 오전 8:24, Ryan Blue 님이 작성:
>
>> Jungtaek
punted to the user. I can understand retaining
> old behavior under a flag where the behavior change could be
> problematic for some users or facilitate migration, but this is just a
> change to some UI links no? the underlying links don't change.
> On Fri, Feb 8, 2019 at 5:41 PM Ry
Sure. I'll start a thread.
On Mon, Feb 18, 2019 at 6:27 PM Wenchen Fan wrote:
> I think this is the right direction to go. Shall we move forward with a
> vote and detailed designs?
>
> On Mon, Feb 4, 2019 at 9:57 AM Ryan Blue wrote:
>
>> Hi everyone,
>&g
n the next 3 days.
[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...
Thanks!
rb
--
Ryan Blue
Software Engineer
Netflix
:33 AM Maryann Xue
> wrote:
>
>> +1
>>
>> On Mon, Feb 18, 2019 at 10:46 PM John Zhuge wrote:
>>
>>> +1
>>>
>>> On Mon, Feb 18, 2019 at 8:43 PM Dongjoon Hyun
>>> wrote:
>>>
>>>> +1
>>>>
contains sort information, but it isn’t used
because it applies only to single files.
- *Consensus formed not including sorts in v2 table metadata.*
*Attendees*:
Ryan Blue
John Zhuge
Donjoon Hyun
Felix Cheung
Gengliang Wang
Hyukji Kwon
Jacky Lee
Jamison Bennett
Matt Cheah
Yifei Huang
Russel
ing 2 years to get
the work done.
Are there any objections to targeting 3.0 for this?
In addition, much of the planning for multi-catalog support has been done
to make v2 possible. Do we also want to include multi-catalog support?
rb
--
Ryan Blue
Software Engineer
Netflix
> people can still plan around when the release branch will likely be cut.
>
> Matei
>
> > On Feb 21, 2019, at 1:03 PM, Ryan Blue
> wrote:
> >
> > Hi everyone,
> >
> > In the DSv2 sync last night, we had a discussion about roadmap and what
> the goal
features that have remained open for the longest time
> and we really need to move forward on these. Putting a target release for
> 3.0 will help in that regard.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue
> *Reply-To: *"rb...@netflix.com"
> *D
lar major release, that's fine -- in fact, we
>>> quite intentionally did not target new features in the Spark 2.0.0 release.
>>> The fact that some entity other than the PMC thinks that Spark 3.0 should
>>> contain certain new features or that it will be costly to them if 3.0 does
>>> not contain those features is not dispositive. If there are public API
>>> changes that should occur in a timely fashion and there is also a list of
>>> new features that some users or contributors want to see in 3.0 but that
>>> look likely to not be ready in a timely fashion, then the PMC should fully
>>> consider releasing 3.0 without all those new features. There is no reason
>>> that they can't come in with 3.1.0.
>>>
>>
--
Ryan Blue
Software Engineer
Netflix
kyLee wrote:
>>
>>> +1
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -----
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
--
Ryan Blue
Software Engineer
Netflix
a good idea. What is a problem, or is at least something that I have a
> problem with, are declarative, pseudo-authoritative statements that 3.0 (or
> some other release) will or won't contain some feature, API, etc. or that
> some issue is or is not blocker or worth delaying for. When the PMC has not
> voted on such issues, I'm often left thinking, "Wait... what? Who decided
> that, or where did that decision come from?"
>
>
--
Ryan Blue
Software Engineer
Netflix
cement.
On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah wrote:
> Reynold made a note earlier about a proper Row API that isn’t InternalRow
> – is that still on the table?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue
> *Reply-To: *"rb...@netflix.com"
>
have to fix that before we declare dev2 is stable, because
> InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.
> >
> > On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote:
> > Will that then require an API break down the line? Do we save that for
>
posal doc
<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>
.
Please vote in the next 3 days.
[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...
Thanks!
--
Ryan Blue
S
+1 (non-binding)
On Wed, Feb 27, 2019 at 8:34 PM Russell Spitzer
wrote:
> +1 (non-binding)
>
> On Wed, Feb 27, 2019, 6:28 PM Ryan Blue wrote:
>
>> Hi everyone,
>>
>> In the last DSv2 sync, the consensus was that the table metadata SPIP was
>> ready to bri
(e.g.,
INSERT INTO support)
Please vote in the next 3 days on whether you agree with committing to this
goal.
[ ] +1: Agree that we should consider a functional DSv2 implementation a
blocker for Spark 3.0
[ ] +0: . . .
[ ] -1: I disagree with this goal because . . .
Thank you!
--
Ryan Blue
>
> On Thu, Feb 28, 2019 at 1:00 AM Ryan Blue wrote:
>
>> I think that's a good plan. Let's get the functionality done, but mark it
>> experimental pending a new row API.
>>
>> So is there agreement on this set of work, then?
>>
>> On Tue, Feb 2
AM Matt Cheah wrote:
>
>> +1 (non-binding)
>>
>>
>>
>> Are identifiers and namespaces going to be rolled under one of those six
>> points?
>>
>>
>>
>> *From: *Ryan Blue
>> *Reply-To: *"rb...@netflix.com"
>> *Dat
ases is not
> proper project management, IMO.
>
> On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue wrote:
>
>> Mark, if this goal is adopted, "we" is the Apache Spark community.
>>
>> On Thu, Feb 28, 2019 at 9:52 AM Mark Hamstra
>> wrote:
>>
>&g
few PRs in review" issue? you worry that
> we might rush DSv2 at the end to meet a deadline? all the better to,
> if anything, agree it's important now. It's also an agreement to delay
> the release for it, not rush it. I don't see that later is a better
> time to make th
Young-Garner <
> anthony.young-gar...@cloudera.com.invalid> wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Feb 28, 2019 at 5:54 PM John Zhuge wrote:
>>
>> +1 (non-binding)
>>
>> On Thu, Feb 28, 2019 at 9:11 AM Matt Cheah wrote:
>>
>> +1 (non
Actually, I went ahead and removed the confusing section. There is no
public API in the doc now, so that it is clear that it isn't a relevant
part of this vote.
On Fri, Mar 1, 2019 at 4:58 PM Ryan Blue wrote:
> I moved the public API to the "Implementation Sketch" section. Th
This vote fails with the following counts:
3 +1 votes:
- Matt Cheah
- Ryan Blue
- Sean Owen (binding)
1 -0 vote:
- Jose Torres
2 -1 votes:
- Mark Hamstra (binding)
- Midrul Muralidharan (binding)
Thanks for the discussion, everyone, It sounds to me that the main
objection
people to join?
>
> Stavros
>
> On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue
> wrote:
>
>> Here are my notes from the DSv2 sync last night. As always, if you have
>> corrections, please reply with them. And if you’d like to be included on
>> the invite to partic
partitioned using Hive Hash? By
>> understanding, I mean that I’m able to avoid a full shuffle join on Table A
>> (partitioned by Hive Hash) when joining with a Table B that I can shuffle
>> via Hive Hash to Table A.
>>
>>
>>
>> Thank you,
>>
>> Tyson
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
> >>> >>>>>>>> do
> >>> >>>>>>>> more than
> >>> >>>>>>>> sending stuff to dev@. One very lightweight idea is to have a
> >>> >>>>>>>> new
> >>> >>>>>>>> type of
> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows all
> >>> >>>>>>>> such
> >>> >>>>>>>> JIRAs from
> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
> design
> >>> >>>>>>>> doc
> >>> >>>>>>>> templates (in fact many projects have them).
> >>> >>>>>>>>
> >>> >>>>>>>> Matei
> >>> >>>>>>>>
> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
> >>> >>>>>>>> wrote:
> >>> >>>>>>>>
> >>> >>>>>>>> I called Cody last night and talked about some of the topics
> in
> >>> >>>>>>>> his
> >>> >>>>>>>> email.
> >>> >>>>>>>> It became clear to me Cody genuinely cares about the project.
> >>> >>>>>>>>
> >>> >>>>>>>> Some of the frustrations come from the success of the project
> >>> >>>>>>>> itself
> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity from
> >>> >>>>>>>> people
> >>> >>>>>>>> who
> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is in some
> >>> >>>>>>>> ways
> >>> >>>>>>>> similar
> >>> >>>>>>>> to scaling an engineering team in a successful startup: old
> >>> >>>>>>>> processes that
> >>> >>>>>>>> worked well might not work so well when it gets to a certain
> >>> >>>>>>>> size,
> >>> >>>>>>>> cultures
> >>> >>>>>>>> can get diluted, building culture vs building process, etc.
> >>> >>>>>>>>
> >>> >>>>>>>> I also really like to have a more visible process for larger
> >>> >>>>>>>> changes,
> >>> >>>>>>>> especially major user facing API changes. Historically we
> upload
> >>> >>>>>>>> design docs
> >>> >>>>>>>> for major changes, but it is not always consistent and
> difficult
> >>> >>>>>>>> to
> >>> >>>>>>>> quality
> >>> >>>>>>>> of the docs, due to the volunteering nature of the
> organization.
> >>> >>>>>>>>
> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
> building a
> >>> >>>>>>>> culture
> >>> >>>>>>>> to improve clarity:
> >>> >>>>>>>>
> >>> >>>>>>>> - Process: Large changes should have design docs posted on
> JIRA.
> >>> >>>>>>>> One
> >>> >>>>>>>> thing
> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to me is
> we
> >>> >>>>>>>> should
> >>> >>>>>>>> create a design doc template for the project and ask everybody
> >>> >>>>>>>> to
> >>> >>>>>>>> follow.
> >>> >>>>>>>> The design doc template should also explicitly list goals and
> >>> >>>>>>>> non-goals, to
> >>> >>>>>>>> make design doc more consistent.
> >>> >>>>>>>>
> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some this
> >>> >>>>>>>> with
> >>> >>>>>>>> some
> >>> >>>>>>>> changes, but again very inconsistent. Just posting something
> on
> >>> >>>>>>>> JIRA
> >>> >>>>>>>> isn't
> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and the
> >>> >>>>>>>> signal
> >>> >>>>>>>> get lost
> >>> >>>>>>>> in the noise. While this is generally impossible to enforce
> >>> >>>>>>>> because
> >>> >>>>>>>> we can't
> >>> >>>>>>>> force all volunteers to conform to a process (or they might
> not
> >>> >>>>>>>> even
> >>> >>>>>>>> be
> >>> >>>>>>>> aware of this), those who are more familiar with the project
> >>> >>>>>>>> can
> >>> >>>>>>>> help by
> >>> >>>>>>>> emailing the dev@ when they see something that hasn't been.
> >>> >>>>>>>>
> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
> feedback.
> >>> >>>>>>>> A
> >>> >>>>>>>> design
> >>> >>>>>>>> doc should serve as the base for discussion and is by no means
> >>> >>>>>>>> the
> >>> >>>>>>>> final
> >>> >>>>>>>> design. Of course, this does not mean the author has to accept
> >>> >>>>>>>> every
> >>> >>>>>>>> feedback. They should also be comfortable accepting /
> rejecting
> >>> >>>>>>>> ideas on
> >>> >>>>>>>> technical grounds.
> >>> >>>>>>>>
> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can be
> >>> >>>>>>>> useful
> >>> >>>>>>>> to
> >>> >>>>>>>> have
> >>> >>>>>>>> some monthly Google hangouts that are open to the world. I am
> >>> >>>>>>>> actually not
> >>> >>>>>>>> sure how well this will work, because of the volunteering
> nature
> >>> >>>>>>>> and
> >>> >>>>>>>> we need
> >>> >>>>>>>> to adjust for timezones for people across the globe, but it
> >>> >>>>>>>> seems
> >>> >>>>>>>> worth
> >>> >>>>>>>> trying.
> >>> >>>>>>>>
> >>> >>>>>>>> - Culture: Contributors (including committers) should be more
> >>> >>>>>>>> direct
> >>> >>>>>>>> in
> >>> >>>>>>>> setting expectations, including whether they are working on a
> >>> >>>>>>>> specific
> >>> >>>>>>>> issue, whether they will be working on a specific issue, and
> >>> >>>>>>>> whether
> >>> >>>>>>>> an
> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I know in
> >>> >>>>>>>> this
> >>> >>>>>>>> community
> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it is
> >>> >>>>>>>> often
> >>> >>>>>>>> more
> >>> >>>>>>>> annoying to a contributor to not know anything than getting a
> >>> >>>>>>>> no.
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
> >>> >>>>>>>> <[hidden email]>
> >>> >>>>>>>> wrote:
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement Proposal"
> >>> >>>>>>>>> process that
> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I don't
> >>> >>>>>>>>> think
> >>> >>>>>>>>> committers are trying to minimize their own work -- every
> >>> >>>>>>>>> committer
> >>> >>>>>>>>> cares
> >>> >>>>>>>>> about making the software useful for users. However, it is
> >>> >>>>>>>>> always
> >>> >>>>>>>>> hard to
> >>> >>>>>>>>> get user input and so it helps to have this kind of process.
> >>> >>>>>>>>> I've
> >>> >>>>>>>>> certainly
> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just to see
> >>> >>>>>>>>> the
> >>> >>>>>>>>> biggest
> >>> >>>>>>>>> things on the roadmap.
> >>> >>>>>>>>>
> >>> >>>>>>>>> When you're talking about "changing interfaces", are you
> >>> >>>>>>>>> talking
> >>> >>>>>>>>> about
> >>> >>>>>>>>> public or internal APIs? I do think many people hate changing
> >>> >>>>>>>>> public APIs
> >>> >>>>>>>>> and I actually think that's for the best of the project.
> That's
> >>> >>>>>>>>> a
> >>> >>>>>>>>> technical
> >>> >>>>>>>>> debate, but basically, the worst thing when you're using a
> >>> >>>>>>>>> piece
> >>> >>>>>>>>> of
> >>> >>>>>>>>> software
> >>> >>>>>>>>> is that the developers constantly ask you to rewrite your app
> >>> >>>>>>>>> to
> >>> >>>>>>>>> update to a
> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
> anyone
> >>> >>>>>>>>> who's used
> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change their
> >>> >>>>>>>>> code
> >>> >>>>>>>>> this
> >>> >>>>>>>>> release" model works well within a single large company, but
> >>> >>>>>>>>> doesn't work
> >>> >>>>>>>>> well for a community, which is why nearly all *very* widely
> >>> >>>>>>>>> used
> >>> >>>>>>>>> programming
> >>> >>>>>>>>> interfaces (I'm talking things like Java standard library,
> >>> >>>>>>>>> Windows
> >>> >>>>>>>>> API, etc)
> >>> >>>>>>>>> almost *never* break backwards compatibility. All this is
> done
> >>> >>>>>>>>> within reason
> >>> >>>>>>>>> though, e.g. we do change things in major releases (2.x, 3.x,
> >>> >>>>>>>>> etc).
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> -
> >>> >>>>>> To unsubscribe e-mail: [hidden email]
> >>> >>>>>>
> >>> >>>>>
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> --
> >>> >>>>> Stavros Kontopoulos
> >>> >>>>> Senior Software Engineer
> >>> >>>>> Lightbend, Inc.
> >>> >>>>> p: +30 6977967274
> >>> >>>>> e: [hidden email]
> >>> >>>>>
> >>> >>>>>
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >>
> >>>
> >>
> >
> >
> > -
> > To unsubscribe e-mail: [hidden email]
> >
> >
> >
> >
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> > http://apache-spark-developers-list.1001551.n3.
> nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
> >
> > To start a new topic under Apache Spark Developers List, email [hidden
> > email]
> > To unsubscribe from Apache Spark Developers List, click here.
> > NAML
> >
> >
> >
> > View this message in context: RE: Spark Improvement Proposals
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
Sorry, I missed that the proposal includes majority approval. Why majority
instead of consensus? I think we want to build consensus around these
proposals and it makes sense to discuss until no one would veto.
rb
On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue wrote:
> +1 to votes to appr
is better, I
> don't care. Again, I don't feel strongly about the way we achieve
> clarity, just that we achieve clarity.
>
> On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue wrote:
> > Sorry, I missed that the proposal includes majority approval. Why
> majority
> >
;>>> reality. Beyond that, if people think it's more open to allow formal
> >>>> proposals from anyone, I'm not necessarily against it, but my main
> >>>> question would be this:
> >>>>
> >>>> If anyone can submit a proposa
38-deee-476a-93ff-92fead06e...@hortonworks.com%3E]
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
Are these changes that the Hive community has rejected? I don't see a
compelling reason to have a long-term Spark fork of Hive.
rb
On Sat, Oct 15, 2016 at 5:27 AM, Steve Loughran
wrote:
>
> On 15 Oct 2016, at 01:28, Ryan Blue wrote:
>
> The Spark 2 branch is based o
t; to handle for example Option[Set[Int]], but it really
>> cannot handle Set so it leads to a runtime exception.
>>
>> would it be useful to make this a little more specific? i guess the
>> challenge is going to be case classes which unfortunately dont extend
>> Product1, Product2, etc.
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>
--
Ryan Blue
Software Engineer
Netflix
; >>> > something needs to change.
> >>> >
> >>> > - We need a clear process for planning significant changes to the
> >>> > codebase.
> >>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> >>> > but you need a do
library dep to 1.9? If not,
>> I can at least get started on it and publish a PR.
>>
>> Cheers,
>>
>> Michael
>> -----
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Software Engineer
Netflix
Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue
> wrote:
>
>> 1.9.0 includes some fixes intended specifically for Spark:
>>
>> * PARQUET-389: Evaluates push-down predicates for missing columns as
>> though they are null. This is to address Spark's work-around that requires
github.com/apache/spark/pull/15538 needs to make
> it into 2.1. The logging output issue is really bad. I would probably call
> it a blocker.
>
> Michael
>
>
> On Nov 1, 2016, at 1:22 PM, Ryan Blue wrote:
>
> I can when I'm finished with a couple other issues if
t; https://cwiki.apache.org/confluence/display/FLINK/
> Flink+Internals
> >>> >>> >>>
> >>> >>> >>> Spark is no longer an engine that works for micro-batch and
> >>> >>> >>> batch...We
> >>> >>> >>> (and
> >>> >>> >>> I am sure many others) are pushing spark as an engine for
> stream
> >>> >>> >>> and
> >>> >>> >>> query
> >>> >>> >>> processing.we need to make it a state-of-the-art engine for
> >>> >>> >>> high
> >>> >>> >>> speed
> >>> >>> >>> streaming data and user queries as well !
> >>> >>> >>>
> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
> >>> >>> >>>
> >>> >>> >>> wrote:
> >>> >>> >>>>
> >>> >>> >>>> Hi everyone,
> >>> >>> >>>>
> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
> >>> >>> >>>> help a
> >>> >>> >>>> little bit. :) Many technical and organizational topics were
> >>> >>> >>>> mentioned,
> >>> >>> >>>> but I want to focus on these negative posts about Spark and
> >>> >>> >>>> about
> >>> >>> >>>> "haters"
> >>> >>> >>>>
> >>> >>> >>>> I really like Spark. Easy of use, speed, very good community -
> >>> >>> >>>> it's
> >>> >>> >>>> everything here. But Every project has to "flight" on
> "framework
> >>> >>> >>>> market"
> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
> >>> >>> >>>> communities,
> >>> >>> >>>> maybe my mail will inspire someone :)
> >>> >>> >>>>
> >>> >>> >>>> You (every Spark developer; so far I didn't have enough time
> to
> >>> >>> >>>> join
> >>> >>> >>>> contributing to Spark) has done excellent job. So why are some
> >>> >>> >>>> people
> >>> >>> >>>> saying that Flink (or other framework) is better, like it was
> >>> >>> >>>> posted
> >>> >>> >>>> in
> >>> >>> >>>> this mailing list? No, not because that framework is better in
> >>> >>> >>>> all
> >>> >>> >>>> cases.. In my opinion, many of these discussions where started
> >>> >>> >>>> after
> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
> "Flink
> >>> >>> >>>> vs
> >>> >>> >>>> "
> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
> >>> >>> >>>> sometimes
> >>> >>> >>>> saying nothing about other frameworks, Flink's users (often
> >>> >>> >>>> PMC's)
> >>> >>> >>>> are
> >>> >>> >>>> just posting same information about real-time streaming, about
> >>> >>> >>>> delta
> >>> >>> >>>> iterations, etc. It look smart and very often it is marked as
> an
> >>> >>> >>>> aswer,
> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
> >>> >>> >>>> perform
> >>> >>> >>>> huge
> >>> >>> >>>> performance test. Maybe some company, that supports Spark
> >>> >>> >>>> (Databricks,
> >>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
> >>> >>> >>>> could
> >>> >>> >>>> perform performance test of:
> >>> >>> >>>>
> >>> >>> >>>> - streaming engine - probably Spark will loose because of
> >>> >>> >>>> mini-batch
> >>> >>> >>>> model, however currently the difference should be much lower
> >>> >>> >>>> that in
> >>> >>> >>>> previous versions
> >>> >>> >>>>
> >>> >>> >>>> - Machine Learning models
> >>> >>> >>>>
> >>> >>> >>>> - batch jobs
> >>> >>> >>>>
> >>> >>> >>>> - Graph jobs
> >>> >>> >>>>
> >>> >>> >>>> - SQL queries
> >>> >>> >>>>
> >>> >>> >>>> People will see that Spark is envolving and is also a modern
> >>> >>> >>>> framework,
> >>> >>> >>>> because after reading posts mentioned above people may think
> "it
> >>> >>> >>>> is
> >>> >>> >>>> outdated, future is in framework X".
> >>> >>> >>>>
> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
> >>> >>> >>>> Structured
> >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use
> >>> >>> >>>> and
> >>> >>> >>>> reliability. Performance tests, done in various environments
> (in
> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
> 20-node
> >>> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
> >>> >>> >>>> you're
> >>> >>> >>>> telling that you're better, but Spark is still faster and is
> >>> >>> >>>> still
> >>> >>> >>>> getting even more fast!". This would be based on facts (just
> >>> >>> >>>> numbers),
> >>> >>> >>>> not opinions. It would be good for companies, for marketing
> >>> >>> >>>> puproses
> >>> >>> >>>> and
> >>> >>> >>>> for every Spark developer
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> Second: real-time streaming. I've written some time ago about
> >>> >>> >>>> real-time
> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work
> >>> >>> >>>> should be
> >>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
> >>> >>> >>>> Maybe
> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
> Akka?
> >>> >>> >>>> I
> >>> >>> >>>> don't
> >>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
> >>> >>> >>>> should
> >>> >>> >>>> have real-time streaming support. Currently I see many
> >>> >>> >>>> posts/comments
> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing
> very
> >>> >>> >>>> good
> >>> >>> >>>> jobs with micro-batches, however I think it is possible to add
> >>> >>> >>>> also
> >>> >>> >>>> more
> >>> >>> >>>> real-time processing.
> >>> >>> >>>>
> >>> >>> >>>> Other people said much more and I agree with proposal of SIP.
> >>> >>> >>>> I'm
> >>> >>> >>>> also
> >>> >>> >>>> happy that PMC's are not saying that they will not listen to
> >>> >>> >>>> users,
> >>> >>> >>>> but
> >>> >>> >>>> they really want to make Spark better for every user.
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> What do you think about these two topics? Especially I'm
> looking
> >>> >>> >>>> at
> >>> >>> >>>> Cody
> >>> >>> >>>> (who has started this topic) and PMCs :)
> >>> >>> >>>>
> >>> >>> >>>> Pozdrawiam / Best regards,
> >>> >>> >>>>
> >>> >>> >>>> Tomasz
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>> >
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
p://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>>>
>>>>
>>>> Q: How can I help test this release?
>>>> A: If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions from 2.0.1.
>>>>
>>>> Q: What justifies a -1 vote for this release?
>>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>>> present in 2.0.1, missing features, or bugs related to new features will
>>>> not necessarily block this release.
>>>>
>>>> Q: What fix version should I use for patches merging into branch-2.0
>>>> from now on?
>>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>> +1 -224-436-0783 <(224)%20436-0783>
>>>>
>>>> IT Architect / Lead Consultant
>>>> Greater Chicago
>>>>
>>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
onfun$productToRowRdd$1.
>> apply(basicOperators.scala:220)
>> > >> >
>> > >> >
>> > >>
>> > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.
>> apply(basicOperators.scala:219)
>> > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> > >> >
>> > >> >
>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>
>> > >> >
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> > >> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> > >> >
>> > >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>> > >> > org.apache.spark.scheduler.Task.run(Task.scala:54)
>> > >> >
>> > >> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>>
>> > >> >
>> > >> >
>> > >>
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> > >> >
>> > >> >
>> > >>
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> > >> > java.lang.Thread.run(Thread.java:722)
>> > >> >
>> > >> >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-spark-developers-list.1001551.n3.
>> nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-
>> tp8517p8528.html
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=19965&i=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
> --
> View this message in context: Re: OutOfMemoryError on parquet
> SnappyDecompressor
> <http://apache-spark-developers-list.1001551.n3.nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-tp8517p19965.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
--
Ryan Blue
Software Engineer
Netflix
;
> Thanks,
> Aniket
>
> On Mon, Nov 21, 2016, 3:24 PM Ryan Blue [via Apache Spark Developers List]
> <[hidden email] <http:///user/SendEmail.jtp?type=node&node=19973&i=0>>
> wrote:
>
>> Aniket,
>>
>> The solution was to add a sort so that o
umber 2 because we will use Apache
> Parquet 1.8.1 for a while.
>
> Bests,
> Dongjoon.
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
ll me which
> version is more compatible with Spark 2.0.2 ?
>
> THanks
>
--
Ryan Blue
Software Engineer
Netflix
1701.mbox/%3CCAO4re1mnWJ3%3Di0NpUmPU%2BwD8G%3DsG_%2BAA2PsFBzZv%3DwrUR1529g%40mail.gmail.com%3E>
on the Parquet dev list. If you're interested in reviewing what goes into
1.8.2 or have suggestions, please follow that thread on the Parquet list.
Thanks!
rb
--
Ryan Blue
Software Engineer
Netflix
(
> https://gist.github.com/zero323/88953975361dbb6afd639b35368a97b4) and
> I'll be happy to open a JIRA and submit a PR if there is any interest in
> that.
>
> --
> Best,
> Maciej
>
>
--
Ryan Blue
Software Engineer
Netflix
ssListener but it does not seem to work in a cluster. The
> event is not triggered in the worker.
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
--
Ryan Blue
Software Engineer
Netflix
at 11:40 AM, Julien Le Dem
>> wrote:
>>
>>> +1
>>> Followed: https://cwiki.apache.org/confluence/display/PARQUET/How+To+V
>>> erify+A+Release
>>> checked sums, ran the build and tests.
>>> We would appreciate someone from the Spark project
java.lang.String.(String.java:207) at
> java.lang.StringBuilder.toString(StringBuilder.java:407) at
> scala.collection.mutable.StringBuilder.toString(StringBuilder.scala:430)
> at org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:101)
> at
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71)
> at
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:55)
> at java.util.TimerThread.mainLoop(Timer.java:555) at
> java.util.TimerThread.run(Timer.java:505)
>
>
--
Ryan Blue
Software Engineer
Netflix
ark.apache.org and my mail was bouncing each
> time so Sean Owen suggested to mail dev.(https://issues.apache.
> org/jira/browse/SPARK-19546). Please give solution to above ticket also
> if possible.
>
> Thanks
>
> --
> Shivam Sharma
>
--
Ryan Blue
Software Engineer
Netflix
354)
> at io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:116)
> at java.lang.Thread.run(Thread.java:745)
>
> 2017-02-15 14:50:14,692 WARN
> org.apache.spark.network.server.TransportChannelHandler:
> Exception in connection from /10.154.16.74:58547
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> Thanks
> Naresh
>
>
--
Ryan Blue
Software Engineer
Netflix
nical
>>>>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>>>>> contribute to it, but rather makes sure it stands a chance of being
>>>>>>> approved when the vote happens. Also, if the author cannot find an
;>>> Even if some PRs are merged, sometimes, we still have to revert them
> >>>>>> back, if the design and implementation are not reviewed carefully.
> We have
> >>>>>> to ensure our quality. Spark is not an application software. It is
> an
> >>>>>> infrastructure software that is being used by many many companies.
> We have
> >>>>>> to be very careful in the design and implementation, especially
> >>>>>> adding/changing the external APIs.
> >>>>>>
> >>>>>>
> >>>>>> When I developed the Mainframe infrastructure/middleware software in
> >>>>>> the past 6 years, I were involved in the discussions with
> external/internal
> >>>>>> customers. The to-do feature list was always above 100. Sometimes,
> the
> >>>>>> customers are feeling frustrated when we are unable to deliver them
> on time
> >>>>>> due to the resource limits and others. Even if they paid us
> billions, we
> >>>>>> still need to do it phase by phase or sometimes they have to accept
> the
> >>>>>> workarounds. That is the reality everyone has to face, I think.
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>> Xiao Li
> >>>>>>>
> >>>>>>>
> >>
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
1 is also welcome.
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Output-Committers-
> for-S3-tp21033.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
ist, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> --
> View this message in context: RE: Will .count() always trigger an
> evaluation of each row?
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21027.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
--
Ryan Blue
Software Engineer
Netflix
On Tue, Feb 21, 2017 at 6:15 AM, Steve Loughran wrote:
> On 21 Feb 2017, at 01:00, Ryan Blue wrote:
> > You'd have to encode the task ID in the output file name to identify files
> > to roll back in the event you need to revert a task, but if you have
> > partitione
1.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
--
Ryan Blue
Software Engineer
Netflix
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
everything
>> depends
>> > only on shepherd .
>> >
>> > Also want to add point that SPIP should be time bound with define SLA
>> else
>> > will defeats purpose.
>> >
>> >
>> > Regards,
>> > Vaquar khan
>> >
&g
add a deprecated annotation to it?
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
etOutputCommitter
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2221)
> >... 28 more
> >
> > can you please point out my mistake.
> >
> > If possible can you give a working example of saving a dataframe as a
> > parquet file in s3.
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Output-Committers-
> for-S3-tp21033p21246.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
pFsRelationCommand.scala:149)
> at org.apache.spark.sql.execution.datasources.
> InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(
> InsertIntoHadoopFsRelationCommand.scala:115)
>
> {logs}
>
>
>
--
Ryan Blue
Software Engineer
Netflix
gt;>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>> spark-1227/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.
>>>>> 1-rc2-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>>>>
>>>>> *But my bug isn't fixed!??!*
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from 2.1.0.
>>>>>
>>>>> *What happened to RC1?*
>>>>>
>>>>> There were issues with the release packaging and as a result was
>>>>> skipped.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>>
>>>>> --
>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
gt;>>>> packaging for RC3 since I'v been poking around in Jenkins a bit (for
>>>>> SPARK-20216
>>>>> & friends) (I'd still probably need some guidance from a previous release
>>>>> coordinator so I understand if that
ve set spark.task.maxFailures to 8 for my job. Seems
> like all task retries happen on the same slave in case of failure. My
> expectation was that task will be retried on different slave in case of
> failure, and chance of all 8 retries to happen on same slave is very less.
>
>
> Regards
>
stage. In that version, you probably want to set
spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs
<http://spark.apache.org/docs/latest/configuration.html> and search for
“blacklist” to see all the options.
rb
On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue wrote:
> Chawl
orresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>>>
>>>>
>>>> *FAQ*
>>>>
>>>> *How can I help test this release?*
>>>>
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>
>>>> *But my bug isn't fixed!??!*
>>>>
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from 2.1.1.
>>>>
>>>
>>
--
Ryan Blue
Software Engineer
Netflix
ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 10:02 AM, Ryan Blue > wrote:
>
> I agree with
t; fnoth...@berkeley.edu
>> fnoth...@eecs.berkeley.edu
>> 202-340-0466 <(202)%20340-0466>
>>
>> On May 1, 2017, at 11:31 AM, Ryan Blue wrote:
>>
>> Frank,
>>
>> The issue you're running into is caused by using parquet-avro with Avro
>>
xpects to
> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead. Spark
> already has to work around this for unit tests to pass.
>
>
>
> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue wrote:
>
>> Thanks for the extra context, Frank. I agree that it sounds like you
1.n3.nabble.com/Parquet-vectorized-
> reader-DELTA-BYTE-ARRAY-tp21538.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
A (SPARK-20507) are not *actually* critical
>> as the project website certainly can be updated separately from the source
>> code guide and is not part of the release to be voted on. In future that
>> particular work item for the QA process could be marked down in priority,
>>
(maybe
> ryan or steve can confirm this assumption) not applicable to the Netflix
> commiter uploaded by Ryan blue. Because Ryan's commiter uses multipart
> upload. So either the whole file is live or nothing is. partial data will
> not be available for read. Whatever partial data that
't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>
> --
> Marcelo
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
gt;>
>> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8
>>
>> Do you think below setting can help me to overcome above issue:
>>
>> spark.default.parellism=1000
>> spark.sql.shuffle.partitions=1000
>>
>> Because default max number of partitions are 1000.
>>
>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
;>>>>> some of the time, and filter B some of the time. If I’m passed in both,
>>>>>>> then either A and B are unhandled, or A, or B, or neither. The work I
>>>>>>> have
>>>>>>> to do to work this out is essentially the same as I have to do while
>>>>>>> actually generating my RDD (essentially I have to generate my
>>>>>>> partitions),
>>>>>>> so I end up doing some weird caching work.
>>>>>>>
>>>>>>> This V2 API proposal has the same issues, but perhaps moreso. In
>>>>>>> PrunedFilteredScan, there is essentially one degree of freedom for
>>>>>>> pruning
>>>>>>> (filters), so you just have to implement caching between
>>>>>>> unhandledFilters
>>>>>>> and buildScan. However, here we have many degrees of freedom; sorts,
>>>>>>> individual filters, clustering, sampling, maybe aggregations eventually
>>>>>>> -
>>>>>>> and these operations are not all commutative, and computing my support
>>>>>>> one-by-one can easily end up being more expensive than computing all in
>>>>>>> one
>>>>>>> go.
>>>>>>>
>>>>>>> For some trivial examples:
>>>>>>>
>>>>>>> - After filtering, I might be sorted, whilst before filtering I
>>>>>>> might not be.
>>>>>>>
>>>>>>> - Filtering with certain filters might affect my ability to push
>>>>>>> down others.
>>>>>>>
>>>>>>> - Filtering with aggregations (as mooted) might not be possible to
>>>>>>> push down.
>>>>>>>
>>>>>>> And with the API as currently mooted, I need to be able to go back
>>>>>>> and change my results because they might change later.
>>>>>>>
>>>>>>> Really what would be good here is to pass all of the filters and
>>>>>>> sorts etc all at once, and then I return the parts I can’t handle.
>>>>>>>
>>>>>>> I’d prefer in general that this be implemented by passing some kind
>>>>>>> of query plan to the datasource which enables this kind of replacement.
>>>>>>> Explicitly don’t want to give the whole query plan - that sounds
>>>>>>> painful -
>>>>>>> would prefer we push down only the parts of the query plan we deem to be
>>>>>>> stable. With the mix-in approach, I don’t think we can guarantee the
>>>>>>> properties we want without a two-phase thing - I’d really love to be
>>>>>>> able
>>>>>>> to just define a straightforward union type which is our supported
>>>>>>> pushdown
>>>>>>> stuff, and then the user can transform and return it.
>>>>>>>
>>>>>>> I think this ends up being a more elegant API for consumers, and
>>>>>>> also far more intuitive.
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> On Mon, 28 Aug 2017 at 18:00 蒋星博 wrote:
>>>>>>>
>>>>>>>> +1 (Non-binding)
>>>>>>>>
>>>>>>>> Xiao Li 于2017年8月28日 周一下午5:38写道:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger :
>>>>>>>>>
>>>>>>>>>> Just wanted to point out that because the jira isn't labeled
>>>>>>>>>> SPIP, it
>>>>>>>>>> won't have shown up linked from
>>>>>>>>>>
>>>>>>>>>> http://spark.apache.org/improvement-proposals.html
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan
>>>>>>>>>> wrote:
>>>>>>>>>> > Hi all,
>>>>>>>>>> >
>>>>>>>>>> > It has been almost 2 weeks since I proposed the data source V2
>>>>>>>>>> for
>>>>>>>>>> > discussion, and we already got some feedbacks on the JIRA
>>>>>>>>>> ticket and the
>>>>>>>>>> > prototype PR, so I'd like to call for a vote.
>>>>>>>>>> >
>>>>>>>>>> > The full document of the Data Source API V2 is:
>>>>>>>>>> > https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
>>>>>>>>>> Z8qU5Frf6WMQZ6jJVM/edit
>>>>>>>>>> >
>>>>>>>>>> > Note that, this vote should focus on high-level
>>>>>>>>>> design/framework, not
>>>>>>>>>> > specified APIs, as we can always change/improve specified APIs
>>>>>>>>>> during
>>>>>>>>>> > development.
>>>>>>>>>> >
>>>>>>>>>> > The vote will be up for the next 72 hours. Please reply with
>>>>>>>>>> your vote:
>>>>>>>>>> >
>>>>>>>>>> > +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>>>> > +0: Don't really care.
>>>>>>>>>> > -1: I don't think this is a good idea because of the following
>>>>>>>>>> technical
>>>>>>>>>> > reasons.
>>>>>>>>>> >
>>>>>>>>>> > Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>>
--
Ryan Blue
Software Engineer
Netflix
301 - 400 of 415 matches
Mail list logo