to get any remaining discussion going or get anyone that
missed this to read through the docs.
Thanks!
rb
--
Ryan Blue
Software Engineer
Netflix
raits go away. And the ORC data source can also be simplified
>> to
>>
>> class OrcReaderFactory(...) extends DataReaderFactory {
>> def createUnsafeRowReader ...
>>
>> def createColumnarBatchReader ...
>> }
>>
>> class OrcDataSourceReader extends DataSourceReader {
>> def createReadFactories = ... // logic to prepare the parameters and
>> create factories
>> }
>>
>> We also have a potential benefit of supporting hybrid storage data
>> source, which may keep real-time data in row format, and history data in
>> columnar format. Then they can make some DataReaderFactory output
>> InternalRow and some output ColumnarBatch.
>>
>> Thoughts?
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
My guess is that we wouldn't want to upgrade to a new minor version of
> Parquet for a Spark maintenance release, so asking for a Parquet
> maintenance release makes sense.
>
> What does everyone think?
>
> Best,
> Henry
>
--
Ryan Blue
Software Engineer
Netflix
Ted Yu <yuzhih...@gmail.com> wrote:
>
>> +1
>>
>> ---- Original message
>> From: Ryan Blue <rb...@netflix.com>
>> Date: 3/30/18 2:28 PM (GMT-08:00)
>> To: Patrick Woody <patrick.woo...@gmail.com>
>> Cc: Russell Spitzer <
ping that through the CBO effort we will continue to
>>>> get more detailed statistics. Like on read we could be using sketch data
>>>> structures to get estimates on unique values and density for each column.
>>>> You may be right that the real way for this to be handl
izer which can decide which method to
>> use rather than having the data source itself do it. This is probably in a
>> far future version of the api.
>>
>> On Thu, Mar 29, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Cassandra can in
jEoo/edit?usp=sharing>
.
Comments and feedback are welcome! Feel free to comment on the doc or reply
to this thread.
rb
--
Ryan Blue
Software Engineer
Netflix
rhead.
>
> For the second, I wouldn't assume that a data source requiring a certain
> write format would give any guarantees around reading the same data? In the
> cases where it is a complete overwrite it would, but for independent writes
> it could still be useful for statistics or c
t; On Tue, Mar 27, 2018 at 7:59 PM, Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Thanks for the clarification, definitely would want to require Sort but
>> only recommend partitioning ... I think that would be useful to request
>> based on details about the inc
n a while, but does Clustering support allow
> requesting that partitions contain elements in order as well? That would be
> a useful trick for me. IE
> Request/Require(SortedOn(Col1))
> Partition 1 -> ((A,1), (A, 2), (B,1) , (B,2) , (C,1) , (C,2))
>
> On Tue, Mar 27,
ific one called
> HashClusteredDistribution.
>
> So currently only Aggregate can benefit from SupportsReportPartitioning
> and save shuffle. We can add a new interface to expose the hash function to
> make it work for Join.
>
> On Tue, Mar 27, 2018 at 9:33 AM, Ryan Blue <rb...@netf
gt;> On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Interesting.
>>>>
>>>> Should requiredClustering return a Set of Expression's ?
>>>> This way, we can determine the order of Expression's by looking at
an determine the order of Expression's by looking at what
>> requiredOrdering()
>> returns.
>>
>> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi Pat,
>>>
>>> Thanks for starting the discussion on thi
validation/assumptions of the table before attempting the write.
>
> Thanks!
> Pat
>
--
Ryan Blue
Software Engineer
Netflix
usual API, its not possible (or
> difficult) to create custom structured streaming sources.
>
>
>
> Consequently, one has to create streaming sources in packages under
> org.apache.spark.sql.
>
>
>
> Any pointers or info is greatly appreciated.
>
--
Ryan Blue
Software Engineer
Netflix
save using DataFrameWriter, resulting 512k-block-size
>
> df_txt.write.mode('overwrite').format('parquet').save('hdfs:
> //spark1/tmp/temp_with_df')
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
he format of the SHA512 hash, can we add
> a SHA256 hash to our releases in this format?
>
> I suppose if it’s not easy to update or add hashes to our existing
> releases, it may be too difficult to change anything here. But I’m not
> sure, so I thought I’d ask.
>
> Nick
>
>
--
Ryan Blue
Software Engineer
Netflix
g.apache.spark.scheduler.DAGScheduler#createResultStage
>>
>>
>>
>> I can see the effect of doing this may be that Job Submissions may not be
>> FIFO depending on how much time Step 1 mentioned above is going to consume.
>>
>>
>>
>> Does above solution suffice for the problem described? And is there any
>> other side effect of this solution?
>>
>>
>>
>> Regards
>>
>> Ajith
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
te set of those high-level logical
operations, most of which are already defined in SQL or implemented by some
write path in Spark.
rb
--
Ryan Blue
Software Engineer
Netflix
gt;> Please see https://s.apache.org/oXKi. At the time of writing,
>>>>>>>>> there are currently no known release blockers.
>>>>>>>>>
>>>>>>>>> =
>>>>>>>>> How can I help test this release?
>>>>>>>>> =
>>>>>>>>>
>>>>>>>>> If you are a Spark user, you can help us test this release by
>>>>>>>>> taking an existing Spark workload and running on this release
>>>>>>>>> candidate,
>>>>>>>>> then reporting any regressions.
>>>>>>>>>
>>>>>>>>> If you're working in PySpark you can set up a virtual env and
>>>>>>>>> install the current RC and see if anything important breaks, in the
>>>>>>>>> Java/Scala you can add the staging repository to your projects
>>>>>>>>> resolvers
>>>>>>>>> and test with the RC (make sure to clean up the artifact cache
>>>>>>>>> before/after
>>>>>>>>> so you don't end up building with a out of date RC going forward).
>>>>>>>>>
>>>>>>>>> ===
>>>>>>>>> What should happen to JIRA tickets still targeting 2.3.0?
>>>>>>>>> ===
>>>>>>>>>
>>>>>>>>> Committers should look at those and triage. Extremely important
>>>>>>>>> bug fixes, documentation, and API tweaks that impact compatibility
>>>>>>>>> should
>>>>>>>>> be worked on immediately. Everything else please retarget to 2.3.1 or
>>>>>>>>> 2.4.0
>>>>>>>>> as appropriate.
>>>>>>>>>
>>>>>>>>> ===
>>>>>>>>> Why is my bug not fixed?
>>>>>>>>> ===
>>>>>>>>>
>>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>>> release unless the bug in question is a regression from 2.2.0. That
>>>>>>>>> being
>>>>>>>>> said, if there is something which is a regression from 2.2.0 and has
>>>>>>>>> not
>>>>>>>>> been correctly targeted please ping me or a committer to help target
>>>>>>>>> the
>>>>>>>>> issue (you can see the open issues listed as impacting Spark 2.3.0 at
>>>>>>>>> https://s.apache.org/WmoI).
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
we should learn from, that we should work
> on stuff we want in the release before the RC, instead of after.
>
> On Thu, Feb 22, 2018 at 1:01 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> What does everyone think about getting some of the newer DataSourceV2
&g
er shuffle services, but it's pretty safe.
>> >>
>> >> On Tue, Feb 20, 2018 at 5:58 PM, Sameer Agarwal <samee...@apache.org>
>> >> wrote:
>> >> > This RC has failed due to
>> >> > https://issues.apache.org/jira/browse/SPARK-23470.
>&g
t;>>>>>>>> Please see https://s.apache.org/oXKi. At the time of writing,
>>>>>>>>> there are currently no known release blockers.
>>>>>>>>>
>>>>>>>>> =
>>>>>>>>> How can I help test this release?
>>>>>>>>> =
>>>>>>>>>
>>>>>>>>> If you are a Spark user, you can help us test this release by
>>>>>>>>> taking an existing Spark workload and running on this release
>>>>>>>>> candidate,
>>>>>>>>> then reporting any regressions.
>>>>>>>>>
>>>>>>>>> If you're working in PySpark you can set up a virtual env and
>>>>>>>>> install the current RC and see if anything important breaks, in the
>>>>>>>>> Java/Scala you can add the staging repository to your projects
>>>>>>>>> resolvers
>>>>>>>>> and test with the RC (make sure to clean up the artifact cache
>>>>>>>>> before/after
>>>>>>>>> so you don't end up building with a out of date RC going forward).
>>>>>>>>>
>>>>>>>>> ===
>>>>>>>>> What should happen to JIRA tickets still targeting 2.3.0?
>>>>>>>>> ===
>>>>>>>>>
>>>>>>>>> Committers should look at those and triage. Extremely important
>>>>>>>>> bug fixes, documentation, and API tweaks that impact compatibility
>>>>>>>>> should
>>>>>>>>> be worked on immediately. Everything else please retarget to 2.3.1 or
>>>>>>>>> 2.4.0
>>>>>>>>> as appropriate.
>>>>>>>>>
>>>>>>>>> ===
>>>>>>>>> Why is my bug not fixed?
>>>>>>>>> ===
>>>>>>>>>
>>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>>> release unless the bug in question is a regression from 2.2.0. That
>>>>>>>>> being
>>>>>>>>> said, if there is something which is a regression from 2.2.0 and has
>>>>>>>>> not
>>>>>>>>> been correctly targeted please ping me or a committer to help target
>>>>>>>>> the
>>>>>>>>> issue (you can see the open issues listed as impacting Spark 2.3.0 at
>>>>>>>>> https://s.apache.org/WmoI).
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sameer Agarwal
>>>>>>>> Computer Science | UC Berkeley
>>>>>>>> http://cs.berkeley.edu/~sameerag
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sameer Agarwal
>>>>>>> Computer Science | UC Berkeley
>>>>>>> http://cs.berkeley.edu/~sameerag
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Takuya UESHIN
>>>> Tokyo, Japan
>>>>
>>>> http://twitter.com/ueshin
>>>>
>>>
>>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
on in FeatureHasher before
>> FeatureHasher released in 2.3.0.
>>
>> https://issues.apache.org/jira/browse/SPARK-23381
>> https://github.com/apache/spark/pull/20568
>>
>> I will fix it soon.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-
s not going to fix the problem —not if this really is corrupted local
> HDD data
>
--
Ryan Blue
Software Engineer
Netflix
?
A transaction made of from a delete and an insert would work. Is this what
we want to use? How do we add this to v2?
rb
--
Ryan Blue
Software Engineer
Netflix
r
>> and easier maintenance.
>>
> Context aside, I really like these rules! I think having query planning be
> the boundary for specialization makes a lot of sense.
>
> (RunnableCommand might also be my fault though sorry! :P)
>
--
Ryan Blue
Software Engineer
Netflix
aware of the issue
> within a month, and we certainly don’t run as large data infrastructure
> compared to Netflix.
>
>
>
> I will keep an eye on this issue.
>
>
>
> Thanks,
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-
file. The issue seems only impact one column, and
> very hard to detect. It seems you have encountered this issue before, what
> do you do to prevent a recurrence?
>
>
>
> Thanks,
>
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-T
If we see the _SUCCESS file, does that suggest all data is
> good?
>
> How can we prevent a recurrence? Can you share your experience?
>
>
>
> Thanks,
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rb...@netflix.com&qu
;
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
ces are between:
> - Spark
> - A (The?) metastore
> - A data source
>
> If we pass in the table identifier is the data source then responsible for
> talking directly to the metastore? Is that what we want? (I'm not sure)
>
> On Fri, Feb 2, 2018 at 10:39 AM, Ryan Blue <
that need to be
in sync with the same convention. On the other hand, passing TableIdentifier
to DataSourceV2Relation and relying on the relation to correctly set the
options passed to readers and writers minimizes the number of places that
conversion needs to happen.
rb
--
Ryan Blue
Software Engineer
and easier maintenance.
rb
--
Ryan Blue
Software Engineer
Netflix
t 9:10 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:
> +1 hangout
>
> --
> *From:* Xiao Li <gatorsm...@gmail.com>
> *Sent:* Wednesday, January 31, 2018 10:46:26 PM
> *To:* Ryan Blue
> *Cc:* Reynold Xin; dev; Wenchen Fen; Russell
; by implementing some new sources or porting an existing source over.
>
>
>
--
Ryan Blue
Software Engineer
Netflix
ere we can improve. It is just far
easier to get a branch committed as-is than to adhere to these guidelines,
but these are important for our releases and downstream users.
Thanks for reading,
rb
--
Ryan Blue
Software Engineer
Netflix
f shading.
>>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x.
>>> I am not sure if that causes problems for apps.
>>>
>>> Normally I'd avoid any major-version change in a minor release. This one
>>> looked potentially entirely internal.
>>> I think if there are any doubts, we can leave it for Spark 3. There was
>>> a bug report that needed a fix from Kryo 4, but it might be minor after all.
>>>
>>>>
>>>>
--
Ryan Blue
Software Engineer
Netflix
Great. What's the JIRA issue?
On Mon, Dec 11, 2017 at 8:12 PM, Jason White <jason.wh...@shopify.com>
wrote:
> Yes, the fix has been merged at should make it into the 2.3 release.
>
> On Mon, Dec 11, 2017, 5:50 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> Is anyon
see
> any commands with metrics, but I could be missing something.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@
t; I also suggested it because this behavior appears to be the default for
> ASF projects. It wasn't clear why Spark was setup differently.
>
>
> On Thu, Oct 5, 2017 at 5:00 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> While I have also felt this frustration and understand the
h...@gmail.com>
>>> wrote:
>>>
>>>> It can stop reopening, but new JIRA issues with duplicate content will
>>>> be created intentionally instead.
>>>>
>>>> Is that policy (privileged reopening) used in other Apache communities
>>>> for that purpose?
>>>>
>>>>
>>>> On Wed, Oct 4, 2017 at 7:06 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> We have this problem occasionally, where a disgruntled user
>>>>> continually reopens an issue after it's closed.
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-21999
>>>>>
>>>>> (Feel free to comment on this one if anyone disagrees)
>>>>>
>>>>> Regardless of that particular JIRA, I'd like to disable to Closed ->
>>>>> Reopened transition for non-committers: https://issues.apache.org/jira
>>>>> /browse/INFRA-15221
>>>>>
>>>>>
>>>>
>>>
>
--
Ryan Blue
Software Engineer
Netflix
b.com/apache/spark-website/pull/66*
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark-2Dwebsite_pull_66=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=cF_k_lDBFIRW7HXbcjAQSyY9hc2aq_au5TZdMKxvSQ8=7bXZCc6_vzMMe_xhzkbfp7iBGafk5C3tF4dKghY3QiI=>
>> as
>> I progress), however the chances of a mistake are higher with any change
>> like this. If there something you normally take for granted as correct when
>> checking a release, please double check this time :)
>>
>> *Should I be committing code to branch-2.1?*
>>
>> Thanks for asking! Please treat this stage in the RC process as "code
>> freeze" so bug fixes only. If you're uncertain if something should be back
>> ported please reach out. If you do commit to branch-2.1 please tag your
>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move
>> the 2.1.3 fixed into 2.1.2 as appropriate.
>>
>> *What happened to RC3?*
>>
>> Some R+zinc interactions kept it from getting out the door.
>> --
>> Twitter: *https://twitter.com/holdenkarau*
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_holdenkarau=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=cF_k_lDBFIRW7HXbcjAQSyY9hc2aq_au5TZdMKxvSQ8=P9h5SBpzXhoiU3Q-z7f9KQTxMrqiXdBvmAnJDdXfqDM=>
>>
>>
>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
t don't have
> metastore.
>
> Personally I prefer proposal 3, because it's not blocked by catalog
> federation, so that we can develop it incrementally. And it makes the
> catalog support optional, so that simple data sources without metastore can
> also implement data source v2.
>
&
n implement some dirty features via options. e.g. file
>>>> format data sources can take partitioning/bucketing from options, data
>>>> source with metastore can use a special flag in options to indicate a
>>>> create table command(without writing data).
>>>>
>>>
>>> I can see how this would make changes smaller, but I don't think it is a
>>> good thing to do. If we do this, then I think we will not really accomplish
>>> what we want to with this (a clean write API).
>>>
>>>
>>>> In other words, Spark connects users to data sources with a clean
>>>> protocol that only focus on data, but this protocol has a backdoor: the
>>>> data source options. Concrete data sources are free to define how to deal
>>>> with metadata, e.g. Cassandra data source can ask users to create table at
>>>> Cassandra side first, then write data at Spark side, or ask users to
>>>> provide more details in options and do CTAS at Spark side. These can be
>>>> done via options.
>>>>
>>>> After catalog federation, hopefully only file format data sources still
>>>> use this backdoor.
>>>>
>>>
>>> Why would file format sources use a back door after catalog federation??
>>>
>>> rb
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>>> This is good because it means future releases can more easily be built and
>>>> signed securely (and I've been updating the documentation in
>>>> https://github.com/apache/spark-website/pull/66 as I progress),
>>>> however the chances of a mistake are higher with any change like this. If
>>>> there something you normally take for granted as correct when checking a
>>>> release, please double check this time :)
>>>>
>>>> *Should I be committing code to branch-2.1?*
>>>>
>>>> Thanks for asking! Please treat this stage in the RC process as "code
>>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>>
>>>> *Why the longer voting window?*
>>>>
>>>> Since there is a large industry big data conference this week I figured
>>>> I'd add a little bit of extra buffer time just to make sure everyone has a
>>>> chance to take a look.
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Luciano Resende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>
--
Ryan Blue
Software Engineer
Netflix
free to define how to deal with
> metadata, e.g. Cassandra data source can ask users to create table at
> Cassandra side first, then write data at Spark side, or ask users to
> provide more details in options and do CTAS at Spark side. These can be
> done via options.
>
> After catalog fe
nless you generate the sources manually)
>
--
Ryan Blue
Software Engineer
Netflix
.
>
> The same thing applies to Hadoop FS data sources, we need to pass metadata
> to the writer anyway.
>
>
>
> On Tue, Sep 26, 2017 at 1:08 AM, Ryan Blue <rb...@netflix.com> wrote:
>
>> However, without catalog federation, Spark doesn’t have an API to ask an
>&
to take
> these informations and create the table, or throw exception if these
> informations don't match the already-configured table.
>
>
> On Fri, Sep 22, 2017 at 9:35 AM, Ryan Blue <rb...@netflix.com> wrote:
>
>> > input data requirement
>>
>> Clustering
lt-in file-based data
> sources, and I don't think we will extend it in the near future, I propose
> to mark them as out of the scope.
>
>
> Any comments are welcome!
> Thanks,
> Wenchen
>
--
Ryan Blue
Software Engineer
Netflix
h a list of predefined options, to save
>> users from typing these options again and again for each query.
>> If that's all, then everything is good, we don't need to add more
>> interfaces to Data Source V2. However, data source tables provide special
>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
>> sources to have some extra ability.
>> Currently these special operators only work for built-in file-based data
>> sources, and I don't think we will extend it in the near future, I propose
>> to mark them as out of the scope.
>>
>>
>> Any comments are welcome!
>> Thanks,
>> Wenchen
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
7 PM, Marcelo Vanzin <van...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> +1 to this. There should be a script in the Spark repo that has all
>>>>>> the logic needed for a release. That script should take the RM's key
>>>&g
able parameters, right now
>>>>> depends on Josh arisen as there are some scripts which generate the jobs
>>>>> which aren't public. I've done temporary fixes in the past with the Python
>>>>> packaging but my understanding is that in the medium term it requires
>
he thread Sean Owen made.
>
> On Fri, Sep 15, 2017 at 4:04 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> I'm not familiar with the release procedure, can you send a link to this
>> Jenkins job? Can anyone run this job, or is it limited to committers?
>>
>&g
> That's a good question, I built the release candidate however the Jenkins
>> scripts don't take a parameter for configuring who signs them rather it
>> always signs them with Patrick's key. You can see this from previous
>> releases which were managed by other folk
e however the Jenkins
> scripts don't take a parameter for configuring who signs them rather it
> always signs them with Patrick's key. You can see this from previous
> releases which were managed by other folks but still signed by Patrick.
>
> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue &
ere is something which is a regression form 2.1.1 that has not been
>>> correctly targeted please ping a committer to help target the issue (you
>>> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.1.2%20OR%20affectedVersion%20%3D%202.1.1)>
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.2>
>>> ?
>>>
>>> At the time of the writing, there is one in progress major issue
>>> SPARK-21985 <https://issues.apache.org/jira/browse/SPARK-21985>, I
>>> believe Andrew Ray & HyukjinKwon are looking into this one.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
--
Ryan Blue
Software Engineer
Netflix
the correct one in Spark, but
> Parquet has been de-factor standard in Spark also. (I'm not comparing this
> with the other DBMS.)
>
> I'm wondering which way we need to go or want to go in Spark?
>
> Bests,
> Dongjoon.
>
--
Ryan Blue
Software Engineer
Netflix
tial credential to
> upload artifacts.
>
>
> On Thu, Sep 7, 2017 at 11:59 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after
>> that) if people are ok with a committer / me running the release process
>> rather than a full PMC member.
>>
>
--
Ryan Blue
Software Engineer
Netflix
ur vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>>
>>>
>>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhov...@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] <http://databricks.com/>
>
>
>
> [image: Announcing Databricks Serverless. The first serverless data
> science and big data platform. Watch the demo from Spark Summit 2017.]
> <http://go.databricks.com/announcing-databricks-serverless>
>
--
Ryan Blue
Software Engineer
Netflix
I do appreciate your feedbacks/comments on the prototype, let's keep
> the discussion there. In the meanwhile, let's have more discussion on the
> overall framework, and drive this project together.
>
> Wenchen
>
>
>
> On Thu, Aug 31, 2017 at 6:22 AM, Ryan Blue <rb...@netflix.com&
future without breaking any APIs. I'd rather us shipping something useful
> that might not be the most comprehensive set, than debating about every
> single feature we should add and then creating something super complicated
> that has unclear value.
>
>
>
> On Wed, Aug 3
tially the same as I have to do while
>>>>>>> actually generating my RDD (essentially I have to generate my
>>>>>>> partitions),
>>>>>>> so I end up doing some weird caching work.
>>>>>>>
>>>>>>> This V2 API proposal has the same issues, but perhaps moreso. In
>>>>>>> PrunedFilteredScan, there is essentially one degree of freedom for
>>>>>>> pruning
>>>>>>> (filters), so you just have to implement caching between
>>>>>>> unhandledFilters
>>>>>>> and buildScan. However, here we have many degrees of freedom; sorts,
>>>>>>> individual filters, clustering, sampling, maybe aggregations eventually
>>>>>>> -
>>>>>>> and these operations are not all commutative, and computing my support
>>>>>>> one-by-one can easily end up being more expensive than computing all in
>>>>>>> one
>>>>>>> go.
>>>>>>>
>>>>>>> For some trivial examples:
>>>>>>>
>>>>>>> - After filtering, I might be sorted, whilst before filtering I
>>>>>>> might not be.
>>>>>>>
>>>>>>> - Filtering with certain filters might affect my ability to push
>>>>>>> down others.
>>>>>>>
>>>>>>> - Filtering with aggregations (as mooted) might not be possible to
>>>>>>> push down.
>>>>>>>
>>>>>>> And with the API as currently mooted, I need to be able to go back
>>>>>>> and change my results because they might change later.
>>>>>>>
>>>>>>> Really what would be good here is to pass all of the filters and
>>>>>>> sorts etc all at once, and then I return the parts I can’t handle.
>>>>>>>
>>>>>>> I’d prefer in general that this be implemented by passing some kind
>>>>>>> of query plan to the datasource which enables this kind of replacement.
>>>>>>> Explicitly don’t want to give the whole query plan - that sounds
>>>>>>> painful -
>>>>>>> would prefer we push down only the parts of the query plan we deem to be
>>>>>>> stable. With the mix-in approach, I don’t think we can guarantee the
>>>>>>> properties we want without a two-phase thing - I’d really love to be
>>>>>>> able
>>>>>>> to just define a straightforward union type which is our supported
>>>>>>> pushdown
>>>>>>> stuff, and then the user can transform and return it.
>>>>>>>
>>>>>>> I think this ends up being a more elegant API for consumers, and
>>>>>>> also far more intuitive.
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> On Mon, 28 Aug 2017 at 18:00 蒋星博 <jiangxb1...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1 (Non-binding)
>>>>>>>>
>>>>>>>> Xiao Li <gatorsm...@gmail.com>于2017年8月28日 周一下午5:38写道:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger <c...@koeninger.org>:
>>>>>>>>>
>>>>>>>>>> Just wanted to point out that because the jira isn't labeled
>>>>>>>>>> SPIP, it
>>>>>>>>>> won't have shown up linked from
>>>>>>>>>>
>>>>>>>>>> http://spark.apache.org/improvement-proposals.html
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> > Hi all,
>>>>>>>>>> >
>>>>>>>>>> > It has been almost 2 weeks since I proposed the data source V2
>>>>>>>>>> for
>>>>>>>>>> > discussion, and we already got some feedbacks on the JIRA
>>>>>>>>>> ticket and the
>>>>>>>>>> > prototype PR, so I'd like to call for a vote.
>>>>>>>>>> >
>>>>>>>>>> > The full document of the Data Source API V2 is:
>>>>>>>>>> > https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
>>>>>>>>>> Z8qU5Frf6WMQZ6jJVM/edit
>>>>>>>>>> >
>>>>>>>>>> > Note that, this vote should focus on high-level
>>>>>>>>>> design/framework, not
>>>>>>>>>> > specified APIs, as we can always change/improve specified APIs
>>>>>>>>>> during
>>>>>>>>>> > development.
>>>>>>>>>> >
>>>>>>>>>> > The vote will be up for the next 72 hours. Please reply with
>>>>>>>>>> your vote:
>>>>>>>>>> >
>>>>>>>>>> > +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>>>> > +0: Don't really care.
>>>>>>>>>> > -1: I don't think this is a good idea because of the following
>>>>>>>>>> technical
>>>>>>>>>> > reasons.
>>>>>>>>>> >
>>>>>>>>>> > Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>>
--
Ryan Blue
Software Engineer
Netflix
.
>> memoryOverhead.
>>
>> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8
>>
>> Do you think below setting can help me to overcome above issue:
>>
>> spark.default.parellism=1000
>> spark.sql.shuffle.partitions=1000
>>
>> Because default max number of partitions are 1000.
>>
>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
good idea because of the following
> technical reasons.
>
> Thanks!
>
> --
> Marcelo
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
ove comments is probably (maybe
> ryan or steve can confirm this assumption) not applicable to the Netflix
> commiter uploaded by Ryan blue. Because Ryan's commiter uses multipart
> upload. So either the whole file is live or nothing is. partial data will
> not be available for read. What
itely not a release blocker.
>>
>> In any event I just resolved SPARK-20507, as I don't believe any website
>> updates are required for this release anyway. That fully resolves the ML QA
>> umbrella (SPARK-20499).
>>
>>>
>>>
--
Ryan Blue
Software Engineer
Netflix
e e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
cts to
> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead. Spark
> already has to work around this for unit tests to pass.
>
>
>
> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue <rb...@netflix.com> wrote:
>
>> Thanks for the extra context, Frank. I a
for a stage. In that version, you probably want to set
spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs
<http://spark.apache.org/docs/latest/configuration.html> and search for
“blacklist” to see all the options.
rb
On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue <rb...@netflix.c
ase
>>>>> coordinator so I understand if that's not actually faster).
>>>>>
>>>>> On Mon, Apr 10, 2017 at 6:39 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>>>
>>>>>> I backported the fix into both branch-2.1 and branch-2.0. Thank
gt;>> https://repository.apache.org/content/repositories/orgapache
>>>>> spark-1227/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.
>>>>> 1-rc2-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>>>>
>>>>> *But my bug isn't fixed!??!*
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from 2.1.0.
>>>>>
>>>>> *What happened to RC1?*
>>>>>
>>>>> There were issues with the release packaging and as a result was
>>>>> skipped.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>>
>>>>> --
>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
mand$$anonfun$run$1.apply$mcV$sp(
> InsertIntoHadoopFsRelationCommand.scala:149)
> at org.apache.spark.sql.execution.datasources.
> InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(
> InsertIntoHadoopFsRelationCommand.scala:115)
>
> {logs}
>
>
>
--
Ryan Blue
Software Engineer
Netflix
dp.s3.S3PartitionedOutputCommitter not
> > org.apache.parquet.hadoop.ParquetOutputCommitter
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2221)
> >... 28 more
> >
> > can you please point out my mistake.
> >
> > If possible can you give a working example of saving a dataframe as a
> > parquet file in s3.
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Output-Committers-
> for-S3-tp21033p21246.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
ess improvement ,It's like everything
>> depends
>> > only on shepherd .
>> >
>> > Also want to add point that SPIP should be time bound with define SLA
>> else
>> > will defeats purpose.
>> >
>> >
>> > Regards,
&g
On Tue, Feb 21, 2017 at 6:15 AM, Steve Loughran <ste...@hortonworks.com> wrote:
> On 21 Feb 2017, at 01:00, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > You'd have to encode the task ID in the output file name to identify files
> > to roll back in the even
acro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> --
> View this message in context: RE: Will .count() always trigger an
> evaluation of each row?
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21027.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
--
Ryan Blue
Software Engineer
Netflix
;
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Output-Committers-
> for-S3-tp21033.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
gt; to ensure our quality. Spark is not an application software. It is
> an
> >>>>>> infrastructure software that is being used by many many companies.
> We have
> >>>>>> to be very careful in the design and implementation, especially
> >>>>>> adding/changing the external APIs.
> >>>>>>
> >>>>>>
> >>>>>> When I developed the Mainframe infrastructure/middleware software in
> >>>>>> the past 6 years, I were involved in the discussions with
> external/internal
> >>>>>> customers. The to-do feature list was always above 100. Sometimes,
> the
> >>>>>> customers are feeling frustrated when we are unable to deliver them
> on time
> >>>>>> due to the resource limits and others. Even if they paid us
> billions, we
> >>>>>> still need to do it phase by phase or sometimes they have to accept
> the
> >>>>>> workarounds. That is the reality everyone has to face, I think.
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>> Xiao Li
> >>>>>>>
> >>>>>>>
> >>
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
the SPIP is
>>>>>>> recorded (rejected, accepted, etc.), and advising about the technical
>>>>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>>>>> contribute to it, but rather makes sure it stands a chance of
.@spark.apache.org and my mail was bouncing each
> time so Sean Owen suggested to mail dev.(https://issues.apache.
> org/jira/browse/SPARK-19546). Please give solution to above ticket also
> if possible.
>
> Thanks
>
> --
> Shivam Sharma
>
--
Ryan Blue
Software Engineer
Netflix
progress"
> java.lang.OutOfMemoryError: Java heap space at
> java.util.Arrays.copyOfRange(Arrays.java:3664) at
> java.lang.String.(String.java:207) at
> java.lang.StringBuilder.toString(StringBuilder.java:407) at
> scala.collection.mutable.StringBuilder.toString(StringBuilder.scala:430)
> at org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:101)
> at
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71)
> at
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:55)
> at java.util.TimerThread.mainLoop(Timer.java:555) at
> java.util.TimerThread.run(Timer.java:505)
>
>
--
Ryan Blue
Software Engineer
Netflix
finished? I tried doing it in
> the JobProgressListener but it does not seem to work in a cluster. The
> event is not triggered in the worker.
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
--
Ryan Blue
Software Engineer
Netflix
1701.mbox/%3CCAO4re1mnWJ3%3Di0NpUmPU%2BwD8G%3DsG_%2BAA2PsFBzZv%3DwrUR1529g%40mail.gmail.com%3E>
on the Parquet dev list. If you're interested in reviewing what goes into
1.8.2 or have suggestions, please follow that thread on the Parquet list.
Thanks!
rb
--
Ryan Blue
Software Engineer
Netflix
astore, can you tell me which
> version is more compatible with Spark 2.0.2 ?
>
> THanks
>
--
Ryan Blue
Software Engineer
Netflix
e Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
>
>> > >> >
>> > >>
>> > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.
>> apply(basicOperators.scala:219)
>> > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> > >> >
>> > >> >
>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>
>> > >> >
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> > >> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> > >> >
>> > >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>> > >> > org.apache.spark.scheduler.Task.run(Task.scala:54)
>> > >> >
>> > >> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>>
>> > >> >
>> > >> >
>> > >>
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> > >> >
>> > >> >
>> > >>
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> > >> > java.lang.Thread.run(Thread.java:722)
>> > >> >
>> > >> >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-spark-developers-list.1001551.n3.
>> nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-
>> tp8517p8528.html
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node=19965=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
> --
> View this message in context: Re: OutOfMemoryError on parquet
> SnappyDecompressor
> <http://apache-spark-developers-list.1001551.n3.nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-tp8517p19965.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
--
Ryan Blue
Software Engineer
Netflix
gt; https://repository.apache.org/content/repositories/orgapachespark-1214/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>>>
>>>>
>>>> Q: How can I help test this release?
>>>> A: If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions from 2.0.1.
>>>>
>>>> Q: What justifies a -1 vote for this release?
>>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>>> present in 2.0.1, missing features, or bugs related to new features will
>>>> not necessarily block this release.
>>>>
>>>> Q: What fix version should I use for patches merging into branch-2.0
>>>> from now on?
>>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>> +1 -224-436-0783 <(224)%20436-0783>
>>>>
>>>> IT Architect / Lead Consultant
>>>> Greater Chicago
>>>>
>>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
gt; >>> >>> batch...We
> >>> >>> >>> (and
> >>> >>> >>> I am sure many others) are pushing spark as an engine for
> stream
> >>> >>> >>> and
> >>> >>> >>> query
> >>> >>> >>> processing.we need to make it a state-of-the-art engine for
> >>> >>> >>> high
> >>> >>> >>> speed
> >>> >>> >>> streaming data and user queries as well !
> >>> >>> >>>
> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
> >>> >>> >>> <tomasz.gaw...@outlook.com>
> >>> >>> >>> wrote:
> >>> >>> >>>>
> >>> >>> >>>> Hi everyone,
> >>> >>> >>>>
> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
> >>> >>> >>>> help a
> >>> >>> >>>> little bit. :) Many technical and organizational topics were
> >>> >>> >>>> mentioned,
> >>> >>> >>>> but I want to focus on these negative posts about Spark and
> >>> >>> >>>> about
> >>> >>> >>>> "haters"
> >>> >>> >>>>
> >>> >>> >>>> I really like Spark. Easy of use, speed, very good community -
> >>> >>> >>>> it's
> >>> >>> >>>> everything here. But Every project has to "flight" on
> "framework
> >>> >>> >>>> market"
> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
> >>> >>> >>>> communities,
> >>> >>> >>>> maybe my mail will inspire someone :)
> >>> >>> >>>>
> >>> >>> >>>> You (every Spark developer; so far I didn't have enough time
> to
> >>> >>> >>>> join
> >>> >>> >>>> contributing to Spark) has done excellent job. So why are some
> >>> >>> >>>> people
> >>> >>> >>>> saying that Flink (or other framework) is better, like it was
> >>> >>> >>>> posted
> >>> >>> >>>> in
> >>> >>> >>>> this mailing list? No, not because that framework is better in
> >>> >>> >>>> all
> >>> >>> >>>> cases.. In my opinion, many of these discussions where started
> >>> >>> >>>> after
> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
> "Flink
> >>> >>> >>>> vs
> >>> >>> >>>> "
> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
> >>> >>> >>>> sometimes
> >>> >>> >>>> saying nothing about other frameworks, Flink's users (often
> >>> >>> >>>> PMC's)
> >>> >>> >>>> are
> >>> >>> >>>> just posting same information about real-time streaming, about
> >>> >>> >>>> delta
> >>> >>> >>>> iterations, etc. It look smart and very often it is marked as
> an
> >>> >>> >>>> aswer,
> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
> >>> >>> >>>> perform
> >>> >>> >>>> huge
> >>> >>> >>>> performance test. Maybe some company, that supports Spark
> >>> >>> >>>> (Databricks,
> >>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
> >>> >>> >>>> could
> >>> >>> >>>> perform performance test of:
> >>> >>> >>>>
> >>> >>> >>>> - streaming engine - probably Spark will loose because of
> >>> >>> >>>> mini-batch
> >>> >>> >>>> model, however currently the difference should be much lower
> >>> >>> >>>> that in
> >>> >>> >>>> previous versions
> >>> >>> >>>>
> >>> >>> >>>> - Machine Learning models
> >>> >>> >>>>
> >>> >>> >>>> - batch jobs
> >>> >>> >>>>
> >>> >>> >>>> - Graph jobs
> >>> >>> >>>>
> >>> >>> >>>> - SQL queries
> >>> >>> >>>>
> >>> >>> >>>> People will see that Spark is envolving and is also a modern
> >>> >>> >>>> framework,
> >>> >>> >>>> because after reading posts mentioned above people may think
> "it
> >>> >>> >>>> is
> >>> >>> >>>> outdated, future is in framework X".
> >>> >>> >>>>
> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
> >>> >>> >>>> Structured
> >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use
> >>> >>> >>>> and
> >>> >>> >>>> reliability. Performance tests, done in various environments
> (in
> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
> 20-node
> >>> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
> >>> >>> >>>> you're
> >>> >>> >>>> telling that you're better, but Spark is still faster and is
> >>> >>> >>>> still
> >>> >>> >>>> getting even more fast!". This would be based on facts (just
> >>> >>> >>>> numbers),
> >>> >>> >>>> not opinions. It would be good for companies, for marketing
> >>> >>> >>>> puproses
> >>> >>> >>>> and
> >>> >>> >>>> for every Spark developer
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> Second: real-time streaming. I've written some time ago about
> >>> >>> >>>> real-time
> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work
> >>> >>> >>>> should be
> >>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
> >>> >>> >>>> Maybe
> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
> Akka?
> >>> >>> >>>> I
> >>> >>> >>>> don't
> >>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
> >>> >>> >>>> should
> >>> >>> >>>> have real-time streaming support. Currently I see many
> >>> >>> >>>> posts/comments
> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing
> very
> >>> >>> >>>> good
> >>> >>> >>>> jobs with micro-batches, however I think it is possible to add
> >>> >>> >>>> also
> >>> >>> >>>> more
> >>> >>> >>>> real-time processing.
> >>> >>> >>>>
> >>> >>> >>>> Other people said much more and I agree with proposal of SIP.
> >>> >>> >>>> I'm
> >>> >>> >>>> also
> >>> >>> >>>> happy that PMC's are not saying that they will not listen to
> >>> >>> >>>> users,
> >>> >>> >>>> but
> >>> >>> >>>> they really want to make Spark better for every user.
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> What do you think about these two topics? Especially I'm
> looking
> >>> >>> >>>> at
> >>> >>> >>>> Cody
> >>> >>> >>>> (who has started this topic) and PMCs :)
> >>> >>> >>>>
> >>> >>> >>>> Pozdrawiam / Best regards,
> >>> >>> >>>>
> >>> >>> >>>> Tomasz
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>> >
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
ttps://github.com/apache/spark/pull/15538 needs to make
> it into 2.1. The logging output issue is really bad. I would probably call
> it a blocker.
>
> Michael
>
>
> On Nov 1, 2016, at 1:22 PM, Ryan Blue <rb...@netflix.com> wrote:
>
> I can when I'm finished with a coup
> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> 1.9.0 includes some fixes intended specifically for Spark:
>>
>> * PARQUET-389: Evaluates push-down predicates for missing columns as
>> though they are null. This is t
ent Proposals exactly,
> >>> > but you need a documented process with a clear outcome (e.g. a vote).
> >>> > Passing around google docs after an implementation has largely been
> >>> > decided on doesn't cut it.
> >>> >
> >>> > - All technical communication needs to be public.
> >>> > Things getting decided in private chat, or when 1/3 of the committers
> >>> > work for the same company and can just talk to each other...
> >>> > Yes, it's convenient, but it's ultimately detrimental to the health
> of
> >>> > the project.
> >>> > The way structured streaming has played out has shown that there are
> >>> > significant technical blind spots (myself included).
> >>> > One way to address that is to get the people who have domain
> knowledge
> >>> > involved, and listen to them.
> >>> >
> >>> > - We need more committers, and more committer diversity.
> >>> > Per committer there are, what, more than 20 contributors and 10 new
> >>> > jira tickets a month? It's too much.
> >>> > There are people (I am _not_ referring to myself) who have been
> around
> >>> > for years, contributed thousands of lines of code, helped educate the
> >>> > public around Spark... and yet are never going to be voted in.
> >>> >
> >>> > - We need a clear process for managing volunteer work.
> >>> > Too many tickets sit around unowned, unclosed, uncertain.
> >>> > If someone proposed something and it isn't up to snuff, tell them and
> >>> > close it. It may be blunt, but it's clearer than "silent no".
> >>> > If someone wants to work on something, let them own the ticket and
> set
> >>> > a deadline. If they don't meet it, close it or reassign it.
> >>> >
> >>> > This is not me putting on an Apache Bureaucracy hat. This is me
> >>> > saying, as a fellow hacker and loyal dissenter, something is wrong
> >>> > with the culture and process.
> >>> >
> >>> > Please, let's change it.
> >>> >
> >>> >
> -
> >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >
> >>
> >>
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
se mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>
--
Ryan Blue
Software Engineer
Netflix
gt;>
>> this "claims" to handle for example Option[Set[Int]], but it really
>> cannot handle Set so it leads to a runtime exception.
>>
>> would it be useful to make this a little more specific? i guess the
>> challenge is going to be case classes which unfortunately dont extend
>> Product1, Product2, etc.
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
Are these changes that the Hive community has rejected? I don't see a
compelling reason to have a long-term Spark fork of Hive.
rb
On Sat, Oct 15, 2016 at 5:27 AM, Steve Loughran <ste...@hortonworks.com>
wrote:
>
> On 15 Oct 2016, at 01:28, Ryan Blue <rb...@netflix.com
.
> mbox/%3ca0aa8b38-deee-476a-93ff-92fead06e...@hortonworks.com%3E]
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
ther than committers can cast a meaningful vote, that's the
> >>>> reality. Beyond that, if people think it's more open to allow formal
> >>>> proposals from anyone, I'm not necessarily against it, but my main
> >>>> question would be thi
t; >>>>>>>> type of
> >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows all
> >>> >>>>>>>> such
> >>> >>>>>>>> JIRAs from
> >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and
> design
> >>> >>>>>>>> doc
> >>> >>>>>>>> templates (in fact many projects have them).
> >>> >>>>>>>>
> >>> >>>>>>>> Matei
> >>> >>>>>>>>
> >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]>
> >>> >>>>>>>> wrote:
> >>> >>>>>>>>
> >>> >>>>>>>> I called Cody last night and talked about some of the topics
> in
> >>> >>>>>>>> his
> >>> >>>>>>>> email.
> >>> >>>>>>>> It became clear to me Cody genuinely cares about the project.
> >>> >>>>>>>>
> >>> >>>>>>>> Some of the frustrations come from the success of the project
> >>> >>>>>>>> itself
> >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity from
> >>> >>>>>>>> people
> >>> >>>>>>>> who
> >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is in some
> >>> >>>>>>>> ways
> >>> >>>>>>>> similar
> >>> >>>>>>>> to scaling an engineering team in a successful startup: old
> >>> >>>>>>>> processes that
> >>> >>>>>>>> worked well might not work so well when it gets to a certain
> >>> >>>>>>>> size,
> >>> >>>>>>>> cultures
> >>> >>>>>>>> can get diluted, building culture vs building process, etc.
> >>> >>>>>>>>
> >>> >>>>>>>> I also really like to have a more visible process for larger
> >>> >>>>>>>> changes,
> >>> >>>>>>>> especially major user facing API changes. Historically we
> upload
> >>> >>>>>>>> design docs
> >>> >>>>>>>> for major changes, but it is not always consistent and
> difficult
> >>> >>>>>>>> to
> >>> >>>>>>>> quality
> >>> >>>>>>>> of the docs, due to the volunteering nature of the
> organization.
> >>> >>>>>>>>
> >>> >>>>>>>> Some of the more concrete ideas we discussed focus on
> building a
> >>> >>>>>>>> culture
> >>> >>>>>>>> to improve clarity:
> >>> >>>>>>>>
> >>> >>>>>>>> - Process: Large changes should have design docs posted on
> JIRA.
> >>> >>>>>>>> One
> >>> >>>>>>>> thing
> >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to me is
> we
> >>> >>>>>>>> should
> >>> >>>>>>>> create a design doc template for the project and ask everybody
> >>> >>>>>>>> to
> >>> >>>>>>>> follow.
> >>> >>>>>>>> The design doc template should also explicitly list goals and
> >>> >>>>>>>> non-goals, to
> >>> >>>>>>>> make design doc more consistent.
> >>> >>>>>>>>
> >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some this
> >>> >>>>>>>> with
> >>> >>>>>>>> some
> >>> >>>>>>>> changes, but again very inconsistent. Just posting something
> on
> >>> >>>>>>>> JIRA
> >>> >>>>>>>> isn't
> >>> >>>>>>>> sufficient, because there are simply too many JIRAs and the
> >>> >>>>>>>> signal
> >>> >>>>>>>> get lost
> >>> >>>>>>>> in the noise. While this is generally impossible to enforce
> >>> >>>>>>>> because
> >>> >>>>>>>> we can't
> >>> >>>>>>>> force all volunteers to conform to a process (or they might
> not
> >>> >>>>>>>> even
> >>> >>>>>>>> be
> >>> >>>>>>>> aware of this), those who are more familiar with the project
> >>> >>>>>>>> can
> >>> >>>>>>>> help by
> >>> >>>>>>>> emailing the dev@ when they see something that hasn't been.
> >>> >>>>>>>>
> >>> >>>>>>>> - Culture: The design doc author(s) should be open to
> feedback.
> >>> >>>>>>>> A
> >>> >>>>>>>> design
> >>> >>>>>>>> doc should serve as the base for discussion and is by no means
> >>> >>>>>>>> the
> >>> >>>>>>>> final
> >>> >>>>>>>> design. Of course, this does not mean the author has to accept
> >>> >>>>>>>> every
> >>> >>>>>>>> feedback. They should also be comfortable accepting /
> rejecting
> >>> >>>>>>>> ideas on
> >>> >>>>>>>> technical grounds.
> >>> >>>>>>>>
> >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can be
> >>> >>>>>>>> useful
> >>> >>>>>>>> to
> >>> >>>>>>>> have
> >>> >>>>>>>> some monthly Google hangouts that are open to the world. I am
> >>> >>>>>>>> actually not
> >>> >>>>>>>> sure how well this will work, because of the volunteering
> nature
> >>> >>>>>>>> and
> >>> >>>>>>>> we need
> >>> >>>>>>>> to adjust for timezones for people across the globe, but it
> >>> >>>>>>>> seems
> >>> >>>>>>>> worth
> >>> >>>>>>>> trying.
> >>> >>>>>>>>
> >>> >>>>>>>> - Culture: Contributors (including committers) should be more
> >>> >>>>>>>> direct
> >>> >>>>>>>> in
> >>> >>>>>>>> setting expectations, including whether they are working on a
> >>> >>>>>>>> specific
> >>> >>>>>>>> issue, whether they will be working on a specific issue, and
> >>> >>>>>>>> whether
> >>> >>>>>>>> an
> >>> >>>>>>>> issue or pr or jira should be rejected. Most people I know in
> >>> >>>>>>>> this
> >>> >>>>>>>> community
> >>> >>>>>>>> are nice and don't enjoy telling other people no, but it is
> >>> >>>>>>>> often
> >>> >>>>>>>> more
> >>> >>>>>>>> annoying to a contributor to not know anything than getting a
> >>> >>>>>>>> no.
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia
> >>> >>>>>>>> <[hidden email]>
> >>> >>>>>>>> wrote:
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement Proposal"
> >>> >>>>>>>>> process that
> >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I don't
> >>> >>>>>>>>> think
> >>> >>>>>>>>> committers are trying to minimize their own work -- every
> >>> >>>>>>>>> committer
> >>> >>>>>>>>> cares
> >>> >>>>>>>>> about making the software useful for users. However, it is
> >>> >>>>>>>>> always
> >>> >>>>>>>>> hard to
> >>> >>>>>>>>> get user input and so it helps to have this kind of process.
> >>> >>>>>>>>> I've
> >>> >>>>>>>>> certainly
> >>> >>>>>>>>> looked at the *IPs a lot in other software I use just to see
> >>> >>>>>>>>> the
> >>> >>>>>>>>> biggest
> >>> >>>>>>>>> things on the roadmap.
> >>> >>>>>>>>>
> >>> >>>>>>>>> When you're talking about "changing interfaces", are you
> >>> >>>>>>>>> talking
> >>> >>>>>>>>> about
> >>> >>>>>>>>> public or internal APIs? I do think many people hate changing
> >>> >>>>>>>>> public APIs
> >>> >>>>>>>>> and I actually think that's for the best of the project.
> That's
> >>> >>>>>>>>> a
> >>> >>>>>>>>> technical
> >>> >>>>>>>>> debate, but basically, the worst thing when you're using a
> >>> >>>>>>>>> piece
> >>> >>>>>>>>> of
> >>> >>>>>>>>> software
> >>> >>>>>>>>> is that the developers constantly ask you to rewrite your app
> >>> >>>>>>>>> to
> >>> >>>>>>>>> update to a
> >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue
> anyone
> >>> >>>>>>>>> who's used
> >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change their
> >>> >>>>>>>>> code
> >>> >>>>>>>>> this
> >>> >>>>>>>>> release" model works well within a single large company, but
> >>> >>>>>>>>> doesn't work
> >>> >>>>>>>>> well for a community, which is why nearly all *very* widely
> >>> >>>>>>>>> used
> >>> >>>>>>>>> programming
> >>> >>>>>>>>> interfaces (I'm talking things like Java standard library,
> >>> >>>>>>>>> Windows
> >>> >>>>>>>>> API, etc)
> >>> >>>>>>>>> almost *never* break backwards compatibility. All this is
> done
> >>> >>>>>>>>> within reason
> >>> >>>>>>>>> though, e.g. we do change things in major releases (2.x, 3.x,
> >>> >>>>>>>>> etc).
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> -
> >>> >>>>>> To unsubscribe e-mail: [hidden email]
> >>> >>>>>>
> >>> >>>>>
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> --
> >>> >>>>> Stavros Kontopoulos
> >>> >>>>> Senior Software Engineer
> >>> >>>>> Lightbend, Inc.
> >>> >>>>> p: +30 6977967274
> >>> >>>>> e: [hidden email]
> >>> >>>>>
> >>> >>>>>
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >>
> >>>
> >>
> >
> >
> > -
> > To unsubscribe e-mail: [hidden email]
> >
> >
> >
> >
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> > http://apache-spark-developers-list.1001551.n3.
> nabble.com/Spark-Improvement-Proposals-tp19268p19359.html
> >
> > To start a new topic under Apache Spark Developers List, email [hidden
> > email]
> > To unsubscribe from Apache Spark Developers List, click here.
> > NAML
> >
> >
> >
> > View this message in context: RE: Spark Improvement Proposals
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
301 - 399 of 399 matches
Mail list logo