What about a seperate branch for scala 2.10?
Sent from my Samsung Galaxy smartphone.
Original message
From: Koert Kuipers
Date: 4/2/2016 02:10 (GMT+02:00)
To: Michael Armbrust
Cc: Matei Zaharia , Mark
Thanks a lot, Sean, really appreciate your comments.
Sent from my Windows 10 phone
From: Sean Owen
Sent: Friday, April 1, 2016 12:55 PM
To: Renyi Xiong
Cc: Tathagata Das; dev
Subject: Re: Declare rest of @Experimental items non-experimental if
they'veexisted since 1.2.0
The change there was
So I think ramdisk is simple way to try.
Besides I think Reynold's suggestion is quite valid, with such high-end
machine, putting everything in memory might not improve the performance a
lot as assumed. Since bottleneck will be shifted, like memory bandwidth,
NUMA, CPU efficiency
Yes we see it on final write. Our preference is to eliminate this.
On Fri, Apr 1, 2016, 7:25 PM Saisai Shao wrote:
> Hi Michael, shuffle data (mapper output) have to be materialized into disk
> finally, no matter how large memory you have, it is the design purpose of
>
Hi Michael, shuffle data (mapper output) have to be materialized into disk
finally, no matter how large memory you have, it is the design purpose of
Spark. In you scenario, since you have a big memory, shuffle spill should
not happen frequently, most of the disk IO you see might be final shuffle
as long as we don't lock ourselves into supporting scala 2.10 for the
entire spark 2 lifespan it sounds reasonable to me
On Wed, Mar 30, 2016 at 3:25 PM, Michael Armbrust
wrote:
> +1 to Matei's reasoning.
>
> On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia
As I mentioned earlier this flag is now ignored.
On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch wrote:
> Shuffling a 1tb set of keys and values (aka sort by key) results in about
> 500gb of io to disk if compression is enabled. Is there any way to
> eliminate shuffling
It's spark.local.dir.
On Fri, Apr 1, 2016 at 3:37 PM, Yong Zhang wrote:
> Is there a configuration in the Spark of location of "shuffle spilling"? I
> didn't recall ever see that one. Can you share it out?
>
> It will be good for a test writing to RAM Disk if that
Is there a configuration in the Spark of location of "shuffle spilling"? I
didn't recall ever see that one. Can you share it out?
It will be good for a test writing to RAM Disk if that configuration is
available.
Thanks
Yong
From: r...@databricks.com
Date: Fri, 1 Apr 2016 15:32:23 -0700
Shuffling a 1tb set of keys and values (aka sort by key) results in about
500gb of io to disk if compression is enabled. Is there any way to
eliminate shuffling causing io?
On Fri, Apr 1, 2016, 6:32 PM Reynold Xin wrote:
> Michael - I'm not sure if you actually read my
Michael - I'm not sure if you actually read my email, but spill has nothing
to do with the shuffle files on disk. It was for the partitioning (i.e.
sorting) process. If that flag is off, Spark will just run out of memory
when data doesn't fit in memory.
On Fri, Apr 1, 2016 at 3:28 PM, Michael
I think Reynold's suggestion of using ram disk would be a good way to
test if these are the bottlenecks or something else is.
For most practical purposes, pointing local dir to ramdisk should
effectively give you 'similar' performance as shuffling from memory.
Are there concerns with taking that
If you work for a certain hardware vendor that builds expensive, high
performance nodes, and want to use Spark to demonstrate the performance
gains of your new great systems, you will of course totally disagree.
Anyway - I offered you a simple solution to work around the low hanging
fruits. Feel
Sure - feel free to totally disagree.
On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch wrote:
> I totally disagree that it’s not a problem.
>
> - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
> drives.
> - What Spark is depending on is Linux’s IO
I totally disagree that it’s not a problem.
- Network fetch throughput on 40G Ethernet exceeds the throughput of NVME
drives.
- What Spark is depending on is Linux’s IO cache as an effective buffer pool
This is fine for small jobs but not for jobs with datasets in the TB/node range.
- On
spark.shuffle.spill actually has nothing to do with whether we write
shuffle files to disk. Currently it is not possible to not write shuffle
files to disk, and typically it is not a problem because the network fetch
throughput is lower than what disks can sustain. In most cases, especially
with
Blocking operators like Sort, Join or Aggregate will put all of the data
for a whole partition into a hash table or array. However, if you are
running Spark 1.5+ we should be spilling to disk. In Spark 1.6 if you are
seeing OOMs for SQL operations you should report it as a bug.
On Thu, Mar 31,
The change there was just to mark the methods non-experimental. The
logic was that they'd been around for many releases without change,
and are unlikely to be changed now that they've been in the wild so
long, so already acted as if they're part of the normal stable API.
Are they important? I
Hello;
I’m working on spark with very large memory systems (2TB+) and notice that
Spark spills to disk in shuffle. Is there a way to force spark to stay in
memory when doing shuffle operations? The goal is to keep the shuffle data
either in the heap or in off-heap memory (in 1.6.x) and
Hi Sean,
We're upgrading Mobius (C# binding for Spark) in Microsoft to align with
Spark 1.6.2 and noticed some changes in API you did in
https://github.com/apache/spark/commit/6f81eae24f83df51a99d4bb2629dd7daadc01519
mostly on APIs with Approx postfix. (still marked as experimental in
pyspark
Hey Reynold,
Created an issue (and a PR) for this change to get discussions started.
Thanks,
Nezih
On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin wrote:
> Using the right email for Nezih
>
>
> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin wrote:
>
>> I
Amazing! I'll fund $1/2 million for such a interesting initiative.
Oh, wait... I have only $4 on my pocket
Cheers :)
On 1 April 2016 at 11:40, Takeshi Yamamuro wrote:
> Oh, the annual event...
>
> On Fri, Apr 1, 2016 at 4:37 PM, Xiao Li wrote:
>
>>
Guys,
Getting a bit off topic.
Saying Security and HBase in the same sentence is a bit of a joke until HBase
rejiggers its co-processers. Although’s Andrew’s fix could be enough to keep
CSOs and their minions happy.
The larger picture is that Security has to stop being a ‘second thought’.
Oh, the annual event...
On Fri, Apr 1, 2016 at 4:37 PM, Xiao Li wrote:
> April 1st... : )
>
> 2016-04-01 0:33 GMT-07:00 Michael Malak :
>
>> I see you've been burning the midnight oil.
>>
>>
>> --
>> *From:*
April 1st... : )
2016-04-01 0:33 GMT-07:00 Michael Malak :
> I see you've been burning the midnight oil.
>
>
> --
> *From:* Reynold Xin
> *To:* "dev@spark.apache.org"
> *Sent:* Friday, April
I see you've been burning the midnight oil.
From: Reynold Xin
To: "dev@spark.apache.org"
Sent: Friday, April 1, 2016 1:15 AM
Subject: [discuss] using deep learning to improve Spark
Hi all,
Hope you all enjoyed the Tesla 3 unveiling
Hi all,
Hope you all enjoyed the Tesla 3 unveiling earlier tonight.
I'd like to bring your attention to a project called DeepSpark that we have
been working on for the past three years. We realized that scaling software
development was challenging. A large fraction of software engineering has
27 matches
Mail list logo