+1
On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale
wrote:
> +1
>
> On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun
> wrote:
>
>> FYI, there is a proposal to drop Python 3.8 because its EOL is October
>> 2024.
>>
>> https://github.com/apache/spark/pull/46228
>> [SPARK-47993][PYTHON] Drop Python 3.8
One of the problem in the past when something like this was brought up was that
the ASF couldn't have officially blessed venues beyond the already approved
ones. So that's something to look into.
Now of course you are welcome to run unofficial things unblessed as long as
they follow trademark
+1
On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim < kabhwan.opensou...@gmail.com >
wrote:
>
> +1 (non-binding), thanks Gengliang!
>
>
> On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang < ltn...@gmail.com > wrote:
>
>
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Structured Logging
+1
On Fri, Nov 24, 2023 at 10:19 PM, Dongjoon Hyun < dongjoon.h...@gmail.com >
wrote:
>
> +1
>
>
> Thanks,
> Dongjoon.
>
> On Fri, Nov 24, 2023 at 7:14 PM Ye Zhou < zhouyejoe@ gmail. com (
> zhouye...@gmail.com ) > wrote:
>
>
>> +1(non-binding)
>>
>> On Fri, Nov 24, 2023 at 11:16 Mridul
Why do we need this? The reason data source APIs need it is because it will be
used by very unsophisticated end users and used all the time (for each
connection / query). Shuffle is something you set up once, presumably by fairly
sophisticated admins / engineers.
On Sat, Nov 04, 2023 at 2:42
It should be the same as SQL. Otherwise it takes away a lot of potential future
optimization opportunities.
On Mon, Sep 18 2023 at 8:47 AM, Nicholas Chammas < nicholas.cham...@gmail.com >
wrote:
>
> I’ve always considered DataFrames to be logically equivalent to SQL tables
> or queries.
>
>
+1!
On Fri, Jul 7 2023 at 11:58 AM, Holden Karau < hol...@pigscanfly.ca > wrote:
>
> +1
>
>
> On Fri, Jul 7, 2023 at 9:55 AM huaxin gao < huaxin.ga...@gmail.com > wrote:
>
>
>
>> +1
>>
>>
>> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh < mich.talebza...@gmail.com
>> > wrote:
>>
>>
Personally I'd love this, but I agree with some of the earlier comments that
this should not be Python specific (meaning I should be able to implement a
data source in Python and then make it usable across all languages Spark
supports). I think we should find a way to make this reusable beyond
+1
This is a great idea.
On Wed, Jun 21, 2023 at 8:29 AM, Holden Karau < hol...@pigscanfly.ca > wrote:
>
> I’d like to start with a +1, better Python testing tools integrated into
> the project make sense.
>
> On Wed, Jun 21, 2023 at 8:11 AM Amanda Liu < amandastephanieliu@ gmail. com
> (
+1
On Thu, Jan 12, 2023 at 9:46 PM, Dongjoon Hyun < dongjoon.h...@gmail.com >
wrote:
>
> +1 for the proposal (guiding only without any code change).
>
>
> Thanks,
> Dongjoon.
>
> On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu < zsxwing@ gmail. com (
> zsxw...@gmail.com ) > wrote:
>
>
>> +1
Spark Connect :)
(It’s work in progress)
On Mon, Dec 12 2022 at 2:29 PM, Kevin Su < pings...@gmail.com > wrote:
>
> Hey there, How can I get the same spark context in two different python
> processes?
> Let’s say I create a context in Process A, and then I want to use python
> subprocess B to
+1 super excited about this. I think it'd make Spark a lot more usable in
application development and cloud setting:
(1) Makes it easier to embed in applications with thinner client dependencies.
(2) Easier to isolate user code vs system code in the driver.
(3) Opens up the potential to upgrade
Nice! Going to order a few items myself ...
On Tue, Jun 14, 2022 at 7:54 PM, Gengliang Wang < ltn...@gmail.com > wrote:
>
> FYI now you can find the shopping information on https:/ / spark. apache. org/
> community ( https://spark.apache.org/community ) as well :)
>
>
>
> Gengliang
>
>
>
This is why RoundRobinPartitioning shouldn't be used ...
On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu < jasonxu.sp...@gmail.com > wrote:
>
> Hi Spark community,
>
> I reported a data correctness issue in https:/ / issues. apache. org/ jira/
> browse/ SPARK-38388 (
tl;dr: there's no easy way to implement aggregate expressions that'd require
multiple pass over data. It is simply not something that's supported and doing
so would be very high cost.
Would you be OK using approximate percentile? That's relatively cheap.
On Mon, Dec 13, 2021 at 6:43 PM,
Read up on Unsafe here: https://mechanical-sympathy.blogspot.com/
On Sat, Oct 16, 2021 at 12:41 AM, Rohan Bajaj < rohanbaja...@gmail.com > wrote:
>
> In 2015 Reynold Xin made improvements to Spark and it was basically moving
> some structures that were on the java heap and movin
+1
On Thu, Oct 07, 2021 at 11:54 PM, Yuming Wang < wgy...@gmail.com > wrote:
>
> +1 (non-binding).
>
>
> On Fri, Oct 8, 2021 at 1:02 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com (
> dongjoon.h...@gmail.com ) > wrote:
>
>
>> +1 for Apache Spark 3.2.0 RC7.
>>
>>
>> It looks good to me. I
+1. Would open up a huge persona for Spark.
On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler < cutl...@gmail.com > wrote:
>
> +1 (non-binding)
>
>
> On Fri, Mar 26, 2021 at 9:49 AM Maciej < mszymkiew...@gmail.com > wrote:
>
>
>> +1 (nonbinding)
>>
>>
>>
>> On 3/26/21 3:52 PM, Hyukjin Kwon
I don't think we should deprecate existing APIs.
Spark's own Python API is relatively stable and not difficult to support. It
has a pretty large number of users and existing code. Also pretty easy to learn
by data engineers.
pandas API is a great for data science, but isn't that great for some
+1 Correctness issues are serious!
On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan < mri...@gmail.com >
wrote:
>
> That is indeed cause for concern.
> +1 on extending the voting deadline until we finish investigation of this.
>
>
>
>
> Regards,
> Mridul
>
>
>
> On Wed, Feb 24,
Enrico - do feel free to reopen the PRs or email people directly, unless you
are told otherwise.
On Thu, Feb 18, 2021 at 9:09 AM, Nicholas Chammas < nicholas.cham...@gmail.com
> wrote:
>
> On Thu, Feb 18, 2021 at 10:34 AM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
>
>
>>
Late +1
On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh < vii...@gmail.com > wrote:
>
>
>
> Hi devs,
>
>
>
> Thanks for all the inputs. I think overall there are positive inputs in
> Spark community about having RocksDB state store as external module. Then
> let's go forward with this
There's another thing that's not mentioned … it's primarily a problem for
Scala. Due to static typing, we need a very large number of function overloads
for the Scala version of each function, whereas in SQL/Python they are just
one. There's a limit on how many functions we can add, and it also
Exciting & look forward to this!
(And a late +1 vote that probably won't be counted)
On Mon, Nov 09, 2020 at 2:37 PM, Allison Wang < allison.w...@databricks.com >
wrote:
>
>
>
> Thanks everyone for voting! With 11 +1s and no -1s, this vote passes.
>
>
>
> +1s:
> Mridul Muralidharan
>
Take care Holden and best of luck with everything!
On Sat, Oct 31 2020 at 10:21 AM, Holden Karau < hol...@pigscanfly.ca > wrote:
>
> Hi Folks,
>
>
> Just a heads up so folks working on decommissioning or other areas I've
> been active in don't block on me, I'm going to be out for at least a
The issue is memory overhead. Writing files create a lot of buffer (especially
in columnar formats like Parquet/ORC). Even a few file handlers and buffers per
task can OOM the entire process easily.
On Fri, Sep 04, 2020 at 5:51 AM, XIMO GUANTER GONZALBEZ <
Welcome all!
On Tue, Jul 14, 2020 at 10:36 AM, Matei Zaharia < matei.zaha...@gmail.com >
wrote:
>
>
>
> Hi all,
>
>
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new roles! The new committers are:
>
>
>
> - Huaxin Gao
> -
+1 on doing a new patch release soon. I saw some of these issues when preparing
the 3.0 release, and some of them are very serious.
On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu > wrote:
>
>
>
> +1 Thanks Yuanjian -- I think it'll be great to have a
Thanks for doing this. I think this is a great thing to do.
But we gotta be careful with API compatibility.
On Thu, Jun 18, 2020 at 11:32 AM, Holden Karau < hol...@pigscanfly.ca > wrote:
>
> Hi Folks,
>
>
> I've started working on cleaning up the Spark code to remove references to
> slave
Hi all,
Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many of
the innovations from Spark 2.x, bringing new ideas as well as continuing
long-term projects that have been in development. This release resolves more
than 3400 tickets.
We'd like to thank our contributors
com > wrote:
>
> Reynold,
>
>
> What's the plan on pushing the official release binaries and source tar?
> It would be nice to have the release artifacts now that it's available on
> maven.
>
>
> thanks,
> Tom
>
>
> On Monday, June 15, 2020, 01:52:
> Thanks,
> Dongjoon.
>
>
>
> On Tue, Jun 9, 2020 at 9:41 PM Matei Zaharia < matei. zaharia@ gmail. com (
> matei.zaha...@gmail.com ) > wrote:
>
>
>> Congrats! Excited to see the release posted soon.
>>
>>
>>> On Jun 9, 2020,
lease at the time we cut the
> branch.
>
> On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>> I understand the argument to add JDK 11 support just to extend the EOL,
>> but the other things seem kind of arbitrary
I understand the argument to add JDK 11 support just to extend the EOL, but
the other things seem kind of arbitrary and are not supported by your
arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api
stable yet and will continue to evolve in the 3.x line.
Spark is designed in
I waited another day to account for the weekend. This vote passes with the
following +1 votes and no -1 votes!
I'll start the release prep later this week.
+1:
Reynold Xin (binding)
Prashant Sharma (binding)
Gengliang Wang
Sean Owen (binding)
Mridul Muralidharan (binding)
Takeshi Yamamuro
Apologies for the mistake. The vote is open till 11:59pm Pacific time on
Mon June 9th.
On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 3.0.0.
>
> The vote is open until [DUE DAY] and passes if a majority
Please vote on releasing the following candidate as Apache Spark version 3.0.0.
The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are
cast, with a minimum of 3 +1 votes.
[ ] +1 Release this package as Apache Spark 3.0.0
[ ] -1 Do not release this package because ...
To
Please vote on releasing the following candidate as Apache Spark version 3.0.0.
The vote is open until Thu May 21 11:59pm Pacific time and passes if a majority
+1 PMC votes are cast, with a minimum of 3 +1 votes.
[ ] +1 Release this package as Apache Spark 3.0.0
[ ] -1 Do not release this
The con is much more than just more effort to maintain a parallel API. It
puts the burden for all libraries and library developers to maintain a
parallel API as well. That’s one of the primary reasons we moved away from
this RDD vs JavaRDD approach in the old RDD API.
On Tue, Apr 28, 2020 at
bdi...@husky.neu.edu > wrote:
>
> Is it correct to say, the nodes in the DAG are RDDs and the edges are
> computations?
>
>
> On Thu, Apr 16, 2020 at 6:21 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>> The RDD is the DAG.
>>
The RDD is the DAG.
On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi < abdi...@husky.neu.edu > wrote:
>
> Hello everyone,
>
> I am implementing a caching mechanism for analytic workloads running on
> top of Spark and I need to retrieve the Spark DAG right after it is
> generated and the DAG
The Apache Software Foundation requires voting before any release can be
published.
On Tue, Mar 31, 2020 at 11:27 PM, Stephen Coy < s...@infomedia.com.au.invalid >
wrote:
>
>
>> On 1 Apr 2020, at 5:20 pm, Sean Owen < srowen@ gmail. com (
>> sro...@gmail.com ) > wrote:
>>
>> It can be
Please vote on releasing the following candidate as Apache Spark version 3.0.0.
The vote is open until 11:59pm Pacific time Fri Apr 3 , and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
[ ] +1 Release this package as Apache Spark 3.0.0
[ ] -1 Do not release this
the
>>> RCs.
>>>
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Xiao
>>>
>>>
>>>
>>> On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com )
bcc dev, +user
You need to print out the result. Take itself doesn't print. You only got the
results printed to the console because the Scala REPL automatically prints the
returned value from take.
On Thu, Mar 26, 2020 at 12:15 PM, Zahid Rahman < zahidr1...@gmail.com > wrote:
>
> I am
I actually think we should start cutting RCs. We can cut RCs even with blockers.
On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun < dongjoon.h...@gmail.com >
wrote:
>
> Hi, All.
>
> First of all, always "Community Over Code"!
> I wish you the best health and happiness.
>
> As we know, we are
tasource as provider for CREATE TABLE
> syntax", 2019/12/06
> > https:/ / lists. apache. org/ thread. html/
> > 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apac
You are joking when you said " informed widely and discussed in many ways
twice" right?
This thread doesn't even talk about char/varchar:
https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
(Yes it talked about changing the
n-prem.
>
>
>
> Bests,
> Dongjoon.
>
>
> On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>> −User
>>
>>
>>
>> char barely showed up (honestly negligible). I was compari
y
> from the standard on this specific behavior.
>
>
> Bests,
> Dongjoon.
>
> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>> BTW I'm not opposing us sticking to SQL standard (I'm in general fo
so deviate away from the standard on this
specific behavior.
On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < r...@databricks.com > wrote:
>
> I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
>
joon.h...@gmail.com ) > wrote:
>>>
>>> Hi, Reynold.
>>> (And +Michael Armbrust)
>>>
>>>
>>> If you think so, do you think it's okay that we change the return value
>>> silently? Then, I'm wondering why we reverted `TRIM`
>> 100% agree with Reynold.
>>
>>
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>>
>>
>>> Are
the proposed alternative to reduce the potential issue.
>
>
> Please give us your opinion since it's still PR.
>
>
> Bests,
> Dongjoon.
>
> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>>
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of
both new and old users?
For old users, their old code that was working for char(3) would now stop
working.
For new users, depending on whether the underlying metastore char(3) is
either supported but different from ansi
+1
On Mon, Mar 09, 2020 at 3:53 PM, John Zhuge < jzh...@apache.org > wrote:
>
> +1 (non-binding)
>
>
> On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer < heuermh@ gmail. com (
> heue...@gmail.com ) > wrote:
>
>
>> +1 (non-binding)
>>
>>
>> I am disappointed however that this only mentions API
It's a good discussion to have though: should we deprecate dstream, and what do
we need to do to make that happen? My experience working with a lot of Spark
users is that in general I recommend them staying away from dstream, due to a
lot of design and architectural issues.
On Mon, Mar 02,
This is really cool. We should also be more opinionated about how we specify
time and intervals.
On Wed, Feb 12, 2020 at 3:15 PM, Dongjoon Hyun < dongjoon.h...@gmail.com >
wrote:
>
> Thank you, Wenchen.
>
>
> The new policy looks clear to me. +1 for the explicit policy.
>
>
> So, are we
Note that branch-3.0 was cut. Please focus on testing, polish, and let's get
the release out!
On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin < r...@databricks.com > wrote:
>
> Just a reminder - code freeze is coming this Fri !
>
>
>
> There can always be exce
m CleanupAliases
>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>>> aggregate
>>>>> SPARK-25531 new write APIs for data source v2
>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>>> SPARK-20845 Support specification of
If your UDF itself is very CPU intensive, it probably won't make that much of
difference, because the UDF itself will dwarf the serialization/deserialization
overhead.
If your UDF is cheap, it will help tremendously.
On Mon, Jan 20, 2020 at 6:33 PM, < em...@yeikel.com > wrote:
>
>
>
> Hi,
Thanks for writing this up.
Usually when people talk about push-based shuffle, they are motivating it
primarily to reduce the latency of short queries, by pipelining the map phase,
shuffle phase, and the reduce phase (which this design isn't going to address).
It's interesting you are
This seems reasonable!
On Tue, Jan 21, 2020 at 3:23 PM, Dongjoon Hyun < dongjoon.h...@gmail.com >
wrote:
>
> +1, I'm supporting the following proposal.
>
>
> > this mirror as the primary repo in the build, falling back to Central if
> needed.
>
>
> Thanks,
> Dongjoon.
>
>
>
> On Tue,
Introducing a new data type has high overhead, both in terms of internal
complexity and users' cognitive load. Introducing two data types would have
even higher overhead.
I looked quickly and looks like both Redshift and Snowflake, two of the most
recent SQL analytics successes, have only one
Can this perhaps exist as an utility function outside Spark?
On Tue, Jan 07, 2020 at 12:18 AM, Enrico Minack < m...@enrico.minack.dev >
wrote:
>
>
>
> Hi Devs,
>
>
>
> I'd like to get your thoughts on this Dataset feature proposal. Comparing
> datasets is a central operation when
We've pushed out 3.0 multiple times. The latest release window documented on
the website ( http://spark.apache.org/versioning-policy.html ) says we'd code
freeze and cut branch-3.0 early Dec. It looks like we are suffering a bit from
the tragedy of the commons, that nobody is pushing for
If the cost is low, why don't we just do monthly previews until we code freeze?
If it is high, maybe we should discuss and do it when there are people that
volunteer
On Sun, Dec 08, 2019 at 10:32 PM, Xiao Li < gatorsm...@gmail.com > wrote:
>
>
>
> I got many great feedbacks from the
It’s mainly due to compilation speed. Scala compiler is known to be slow.
Even javac is quite slow. We use Janino which is a simpler compiler to get
faster compilation speed at runtime.
Also for low level code we can’t use (due to perf concerns) any of the
edges scala has over java, eg we can’t
Does the description make sense? This is a preview release so there is no
need to retarget versions.
On Tue, Oct 29, 2019 at 7:01 PM Xingbo Jiang wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 3.0.0-preview.
>
> The vote is open until November 2 PST and
Just curious - did we discuss why this shouldn't be another Apache sister
project?
On Wed, Oct 16, 2019 at 10:21 AM, Sean Owen < sro...@gmail.com > wrote:
>
>
>
> We don't all have to agree on whether to add this -- there are like 10
> people with an opinion -- and I certainly would not veto
Can we just tag master?
On Wed, Oct 16, 2019 at 12:34 AM, Wenchen Fan < cloud0...@gmail.com > wrote:
>
> Does anybody remember what we did for 2.0 preview? Personally I'd like to
> avoid cutting branch-3.0 right now, otherwise we need to merge PRs into
> two branches in the following several
te up, but I think we should at least give some
> up-to-date description on that JIRA entry.
>
> On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin < r...@databricks.com > wrote:
>
>
>> No there is no separate write up internally.
>>
>> On Wed, Oct 2, 201
t;
>> Regarding the place in the optimizer rules, it's preferred to happen late
>> in the optimization, and definitely after join reorder.
>>
>>
>> Thanks,
>> Maryann
>>
>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin wrote:
>>
>>&g
Whoever created the JIRA years ago didn't describe dpp correctly, but the
linked jira in Hive was correct (which unfortunately is much more terse than
any of the patches we have in Spark
https://issues.apache.org/jira/browse/HIVE-9152 ). Henry R's description was
also correct.
On Wed, Oct 02,
A while ago we changed it so the task gets broadcasted too, so I think the two
are fairly similar.
On Mon, Sep 23, 2019 at 8:17 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com >
wrote:
>
> I was wondering if anyone could help with this question.
>
> On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti
would want to carefully consider whether that is the right
> decision. And in any case, we would be able to keep 2.5 and 3.0
> compatible, which is the main goal.
>
> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin < r...@databricks.com > wrote:
>
>
>
>> How wou
gt;>> problems with it.
>>>>
>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>> seems like a false premise.
>>>>
>>>>
afeRow
> was part of the original proposal.
>
>
>
> In any case, the goal for 3.0 was not to replace the use of InternalRow ,
> it was to get the majority of SQL working on top of the interface added
> after 2.4. That’s done and stable, so I think a 2.5 release with it is
ntaining compatibility
> between the 2.5 version and the 3.0 version. If we find that we need to
> make API changes (other than additions) then we can make those in the 3.1
> release. Because the goals we set for the 3.0 release have been reached
> with the current API and if we are read
DSv2 is far from stable right? All the actual data types are unstable and you
guys have completely ignored that. We'd need to work on that and that will be a
breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that
seems too invasive of a change to backport once you consider
+1! Long due for a preview release.
On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau < hol...@pigscanfly.ca > wrote:
>
> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time
Having three modes is a lot. Why not just use ansi mode as default, and legacy
for backward compatibility? Then over time there's only the ANSI mode, which is
standard compliant and easy to understand. We also don't need to invent a
standard just for Spark.
On Thu, Sep 05, 2019 at 12:27 AM,
>>>
>>>> maybe in the future, but not right now as the hadoop 2.7 build is broken.
>>>>
>>>>
>>>> also, i busted dev/ run-tests. py ( http://dev/run-tests.py ) in my changes
>>>> to support java11 in PRBs:
>>>> https:/ / github. com/ apache/ spark/ pull/ 25585 (
>>>> https://github.com/apache/spark/pull/25585 )
>>>>
>>>>
>>>>
>>>> quick fix, testing now.
>>>>
>>>> On Mon, Aug 26, 2019 at 10:23 AM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>>
>>>>
>>>>> Would it be possible to have one build that works for both?
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
Would it be possible to have one build that works for both?
On Mon, Aug 26, 2019 at 10:22 AM Dongjoon Hyun
wrote:
> Thank you all!
>
> Let me add more explanation on the current status.
>
> - If you want to run on JDK8, you need to build on JDK8
> - If you want to run on JDK11, you need
>
> Agreed that a separate discussion about overflow might be warranted. I’m
> surprised we don’t throw an error now, but it might be warranted to do so.
>
>
>
>
>
>
>
>
> -Matt Cheah
>
>
>
>
>
>
>
> *From:* Reynold Xin
Matt what do you mean by maximizing 3, while allowing not throwing errors when
any operations overflow? Those two seem contradicting.
On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mch...@palantir.com > wrote:
>
>
>
> I’m -1, simply from disagreeing with the premise that we can afford to not
I like the spirit, but not sure about the exact proposal. Take a look at
k8s':
https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md
On Tue, Jul 23, 2019 at 8:27 PM, Hyukjin Kwon wrote:
> (Plus, it helps to track history too. Spark's commit logs are
's. Any samples to share :)
>
>
> Regards,
> Gourav
>
> On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>> There is no explicit limit but a JVM string cannot be bigger than 2G. It
>> will also
There is no explicit limit but a JVM string cannot be bigger than 2G. It will
also at some point run out of memory with too big of a query plan tree or
become incredibly slow due to query planning complexity. I've seen queries that
are tens of MBs in size.
On Thu, Jul 11, 2019 at 5:01 AM, 李书明
Hi all,
In the past two years, the pandas UDFs are perhaps the most important changes
to Spark for Python data science. However, these functionalities have evolved
organically, leading to some inconsistencies and confusions among users. I
created a ticket and a document summarizing the issues,
That's a good idea. We should only be using squash.
On Mon, Jul 01, 2019 at 1:52 PM, Dongjoon Hyun < dongjoon.h...@gmail.com >
wrote:
>
> Hi, Apache Spark PMC members and committers.
>
>
> We are using GitHub `Merge Button` in `spark-website` repository
> because it's very convenient.
>
>
Seems like a good idea. Can we test this with a component first?
On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun
wrote:
> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been
+1 on Xiangrui’s plan.
On Thu, May 30, 2019 at 7:55 AM shane knapp wrote:
> I don't have a good sense of the overhead of continuing to support
>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>
>> from the build/test side, it will actually be pretty easy to continue
>
Thanks Tom.
I finally had time to look at the updated SPIP 10 mins ago. I support the high
level idea and +1 on the SPIP.
That said, I think the proposed API is too complicated and invasive change to
the existing internals. A much simpler API would be to expose a columnar batch
iterator
thoughts on how to proceed on something like this, as there
> will probably be a few more similar issues.
>
>
>
> On Fri, May 10, 2019 at 3:32 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>>
>>
>> Yea my main point is when w
gt; At some point maybe we figure out whether we can remove the SBT-based
>> build if it's super painful, but only if there's not much other choice.
>> That is for a future thread.
>>
>>
>>
>> On Fri, May 10, 2019 at 1:51 PM Reynold Xin < rxin@ databri
Looks like a great idea to make changes in Spark 3.0 to prepare for Scala 2.13
upgrade.
Are there breaking changes that would require us to have two different source
code for 2.12 vs 2.13?
On Fri, May 10, 2019 at 11:41 AM, Sean Owen < sro...@gmail.com > wrote:
>
>
>
> While that's not
I do feel it'd be better to not switch default Scala versions in a minor
release. I don't know how much downstream this impacts. Dotnet is a good data
point. Anybody else hit this issue?
On Thu, Apr 25, 2019 at 11:36 PM, Terry Kim < yumin...@gmail.com > wrote:
>
>
>
> Very much interested
"if others think it would be helpful, we can cancel this vote, update the SPIP
to clarify exactly what I am proposing, and then restart the vote after we have
gotten more agreement on what APIs should be exposed"
That'd be very useful. At least I was confused by what the SPIP was about. No
lly wouldn't backport, except that I've heard a
> few times about concerns about CVEs affecting Databind, so wondering
> who else out there might have an opinion. I'm not pushing for it
> necessarily.
>
> On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin wrote:
> >
> > For Jackso
1 - 100 of 1256 matches
Mail list logo