Ok, so that worked flawlessly after I upped the number of partitions to 400
from 40.
Thanks!
On Fri, May 13, 2016 at 7:28 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
wrote:
> I'll try that, as of now I have a small number of partitions in the order
> of 20~40.
>
>
).
Otherwise, it's like shooting in the dark.
On Fri, May 13, 2016 at 7:20 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> Have you taken a look at SPARK-11293 ?
>
> Consider using repartition to increase the number of partitions.
>
> FYI
>
> On Fri, May 13, 2016 at 12:14 PM
Hello,
I'm using Spark version 1.6.0 and have trouble with memory when trying to
do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
following exception when I run the task.
There are 20 workers in the cluster. It is running under the standalone
mode with 12 GB assigned
Hello,
Say, that I'm doing a simple rdd.map followed by collect. Say, also, that
one of the executors finish all of its tasks, but there are still other
executors running.
If the machine that hosted the finished executor gets terminated, does the
master still have the results from the finished
ave interruptOnCancel set, then you
> can catch the interrupt exception within the Task.
>
> On Wed, Apr 6, 2016 at 12:24 PM, Sung Hwan Chung <coded...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm looking for ways to add shutdown hooks to executors : i.e
Hi,
I'm looking for ways to add shutdown hooks to executors : i.e., when a Job
is forcefully terminated before it finishes.
The scenario goes likes this : executors are running a long running job
within a 'map' function. The user decides to terminate the job, then the
mappers should perform some
w
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 March 2016 at 23:30, Sung Hwan Chung <coded...@cs.stanford.edu>
> wrote:
>
>> Yea, that seems to be the case. It seems that dynam
d6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 March 2016 at 22:58, Sung Hwan Chung <coded...@cs.stanford.edu>
> wrote:
>
>> It seems that the conf/slaves file is only for consumption by the
>> following scripts:
&g
It seems that the conf/slaves file is only for consumption by the following
scripts:
sbin/start-slaves.sh
sbin/stop-slaves.sh
sbin/start-all.sh
sbin/stop-all.sh
I.e., conf/slaves file doesn't affect a running cluster.
Is this true?
On Mon, Mar 28, 2016 at 9:31 PM, Sung Hwan Chung <co
dOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 March 2016 at 22:06, Sung Hwan Chung <coded...@cs.stanford.edu>
> wrote:
>
>> Hello,
>>
>>
Hello,
I found that I could dynamically add/remove new workers to a running
standalone Spark cluster by simply triggering:
start-slave.sh (SPARK_MASTER_ADDR)
and
stop-slave.sh
E.g., I could instantiate a new AWS instance and just add it to a running
cluster without needing to add it to slaves
Hello,
We are using the default compression codec for Parquet when we store our
dataframes. The dataframe has a StringType column whose values can be upto
several MBs large.
The funny thing is that once it's stored, we can browse the file content
with a plain text editor and see large portions
I noticed that in the main branch, the ec2 directory along with the
spark-ec2 script is no longer present.
Is spark-ec2 going away in the next release? If so, what would be the best
alternative at that time?
A couple more additional questions:
1. Is there any way to add/remove additional workers
, Alexander Pivovarov <apivova...@gmail.com>
wrote:
> you can use EMR-4.3.0 run on spot instances to control the price
>
> yes, you can add/remove instances to the cluster on fly (CORE instances
> support add only, TASK instances - add and remove)
>
>
>
> On Wed, Jan 27, 20
rol the price
>>
>> yes, you can add/remove instances to the cluster on fly (CORE instances
>> support add only, TASK instances - add and remove)
>>
>>
>>
>> On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>&
I haven't seen this at all since switching to HttpBroadcast. It seems
TorrentBroadcast might have some issues?
On Thu, Oct 9, 2014 at 4:28 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
I don't think that I saw any other error message. This is all I saw.
I'm currently experimenting
Un-needed checkpoints are not getting automatically deleted in my
application.
I.e. the lineage looks something like this and checkpoints simply
accumulate in a temporary directory (every lineage point, however, does zip
with a globally permanent):
PermanentRDD:Global zips with all the
the ordering
deterministic.
On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
Let's say you have some rows in a dataset (say X partitions initially).
A
B
C
D
E
.
.
.
.
You repartition to Y X, then it seems that any of the following could
I don't think that I saw any other error message. This is all I saw.
I'm currently experimenting to see if this can be alleviated by using
HttpBroadcastFactory instead of TorrentBroadcast. So far, with
HttpBroadcast, I haven't seen this recurring as of yet. I'll keep you
posted.
On Thu, Oct 9,
I'm getting DFS closed channel exception every now and then when I run
checkpoint. I do checkpointing every 15 minutes or so. This happens usually
after running the job for 1~2 hours. Anyone seen this before?
Job aborted due to stage failure: Task 6 in stage 70.0 failed 4 times,
most recent
:
Using a var for RDDs in this way is not going to work. In this example,
tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after
that, you change what tx2 means, so you would end up having a circular
dependency.
On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung coded
, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
My job is not being fault-tolerant (e.g., when there's a fetch failure or
something).
The lineage of RDDs are constantly updated every iteration. However, I
think that when there's a failure, the lineage information is not being
correctly
I noticed that repartition will result in non-deterministic lineage because
it'll result in changed orders for rows.
So for instance, if you do things like:
val data = read(...)
val k = data.repartition(5)
val h = k.repartition(5)
It seems that this results in different ordering of rows for 'k'
This is also happening to me on a regular basis, when the job is large with
relatively large serialized objects used in each RDD lineage. A bad thing
about this is that this exception always stops the whole job.
On Fri, Sep 26, 2014 at 11:17 AM, Brad Miller bmill...@eecs.berkeley.edu
wrote:
Is the RDD partition index you get when you call mapPartitionWithIndex
consistent under fault-tolerance condition?
I.e.
1. Say index is 1 for one of the partitions when you call
data.mapPartitionWithIndex((index, rows) = ) // Say index is 1
2. The partition fails (maybe a long with a bunch
I sometimes see that after fully caching the data, if one of the executors
fails for some reason, that portion of cache gets lost and does not get
re-cached, even though there are plenty of resources. Is this a bug or a
normal behavior (V1.0.1)?
Do getExecutorStorageStatus and getExecutorMemoryStatus both return the
number of executors + the driver?
E.g., if I submit a job with 10 executors, I get 11 for
getExeuctorStorageStatus.length and getExecutorMemoryStatus.size
On Thu, Jul 24, 2014 at 4:53 PM, Nicolas Mai nicolas@gmail.com
I'm doing something like this:
rdd.groupBy.map().collect()
The work load on final map is pretty much evenly distributed.
When collect happens, say on 60 partitions, the first 55 or so partitions
finish very quickly say within 10 seconds. However, the last 5,
particularly the very last one,
at 11:35 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
I'm doing something like this:
rdd.groupBy.map().collect()
The work load on final map is pretty much evenly distributed.
When collect happens, say on 60 partitions, the first 55 or so partitions
finish very quickly say within 10
Hi,
When I try requesting a large number of executors - e.g. 242, it doesn't
seem to actually reach that number. E.g., under the executors tab, I only
see an executor ID of upto 234.
This despite the fact that there're plenty more memory available as well as
CPU cores, etc in the system. In
I'm doing coalesce with shuffle, cache and then do thousands of iterations.
I noticed that sometimes Spark would for no particular reason perform
partial coalesce again after running for a long time - and there was no
exception or failure on the worker's part.
Why is this happening?
I'm seeing the following message in the log of an executor. Anyone
seen this error? After this, the executor seems to lose the cache, and
but besides that the whole thing slows down drastically - I.e. it gets
stuck in a reduce phase for 40+ minutes, whereas before it was
finishing reduces in 2~3
Hello,
I noticed that the final reduce function happens in the driver node with a
code that looks like the following.
val outputMap = mapPartition(domsomething).reduce(a: Map, b: Map) {
a.merge(b)
}
although individual outputs from mappers are small. Over time the
aggregated result outputMap
I suppose what I want is the memory efficiency of toLocalIterator and the
speed of collect. Is there any such thing?
On Mon, Jun 9, 2014 at 3:19 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
Hello,
I noticed that the final reduce function happens in the driver node with a
code
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume
that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
When these happen things get extremely slow.
Does this mean that the executor got terminated and restarted?
Is there a way to prevent this from happening
that this is the case?
On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume
that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
When these happen things get extremely slow.
Does this mean
When I run sbt/sbt assembly, I get the following exception. Is anyone else
experiencing a similar problem?
..
[info] Resolving org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016
...
[info] Updating {file:/Users/Sung/Projects/spark_06_04_14/}assembly...
[info] Resolving
Nevermind, it turns out that this is a problem for the Pivotal Hadoop that
we are trying to compile against.
On Wed, Jun 4, 2014 at 4:16 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
When I run sbt/sbt assembly, I get the following exception. Is anyone else
experiencing a similar
to actually serialize singletons and pass it
back and forth in a weird manner.
On Mon, Apr 28, 2014 at 1:23 AM, Sean Owen so...@cloudera.com wrote:
On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
e.g. something like
rdd.mapPartition((rows : Iterator[String
:33 AM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
Yes, this is what we've done as of now (if you read earlier threads).
And we were saying that we'd prefer if Spark supported persistent worker
memory management in a little bit less hacky way ;)
On Mon, Apr 28, 2014 at 8:44 AM, Ian
at 11:15 AM, Marcelo Vanzin van...@cloudera.comwrote:
Hi Sung,
On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
The goal is to keep an intermediate value per row in memory, which would
allow faster subsequent computations. I.e., computeSomething would
right ?
A feature extraction algorithm like matrix factorization and it's variants
could be used to decrease feature space as well...
On Fri, Apr 18, 2014 at 10:53 AM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
Thanks for the info on mem requirement.
I think that a lot of businesses
.
YARN integration is actually complete in CDH5.0. We support it as well as
standalone mode.
On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen so...@cloudera.com wrote:
On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
Debasish,
Unfortunately, we are bound
Sorry, that was incomplete information, I think Spark's compression helped
(not sure how much though) that the actual memory requirement may have been
smaller.
On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:
I would argue that memory in clusters is still
Are there scenarios where the developers have to be aware of how Spark's
fault tolerance works to implement correct programs?
It seems that if we want to maintain any sort of mutable state in each
worker through iterations, it can have some unintended effect once a
machine goes down.
E.g.,
Debasish, we've tested the MLLib decision tree a bit and it eats up too
much memory for RF purposes.
Once the tree got to depth 8~9, it was easy to get heap exception, even
with 2~4 GB of memory per worker.
With RF, it's very easy to get 100+ depth in RF with even only 100,000+
rows (because
memory than 2-4GB per worker for most big data workloads.
On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
Debasish, we've tested the MLLib decision tree a bit and it eats up too
much memory for RF purposes.
Once the tree got to depth 8~9, it was easy
with the assessment that forests are a variance
reduction technique, but I'd be a little surprised if a bunch of hugely
deep trees don't overfit to training data. I guess I view limiting tree
depth as an analogue to regularization in linear models.
On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung
coded
to be supported at the tree level.
On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:
Well, if you read the original paper,
http://oz.berkeley.edu/~breiman/randomforest2001.pdf
Grow the tree using CART methodology to maximum size and do not prune.
Now, the elements
...do the paper also propose to grow a shallow tree ?
Thanks.
Deb
On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
Additionally, the 'random features per node' (or mtry in R) is a very
important feature for Random Forest. The variance reduction comes
this feedback into account with respect to
improving the tree implementation, but if anyone can send over use cases or
(even better) datasets where really deep trees are necessary, that would be
great!
On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
Well, if you
Hey guys,
I need to tag individual RDD lines with some values. This tag value would
change at every iteration. Is this possible with RDD (I suppose this is
sort of like mutable RDD, but it's more) ?
If not, what would be the best way to do something like this? Basically, we
need to keep mutable
?
-Sandy
On Wed, Mar 26, 2014 at 1:17 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
Hello, (this is Yarn related)
I'm able to load an external jar and use its classes within
ApplicationMaster. I wish to use this jar within worker nodes, so I added
sc.addJar(pathToJar) and ran.
I get
, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
Yea it's in a standalone mode and I did use SparkContext.addJar method
and tried setting setExecutorEnv SPARK_CLASSPATH, etc. but none of it
worked.
I finally made it work by modifying the ClientBase.scala code where I set
'appMasterOnly
Hello, (this is Yarn related)
I'm able to load an external jar and use its classes within
ApplicationMaster. I wish to use this jar within worker nodes, so I added
sc.addJar(pathToJar) and ran.
I get the following exception:
org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4
55 matches
Mail list logo