Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
Ok, so that worked flawlessly after I upped the number of partitions to 400 from 40. Thanks! On Fri, May 13, 2016 at 7:28 PM, Sung Hwan Chung <coded...@cs.stanford.edu> wrote: > I'll try that, as of now I have a small number of partitions in the order > of 20~40. > >

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
). Otherwise, it's like shooting in the dark. On Fri, May 13, 2016 at 7:20 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you taken a look at SPARK-11293 ? > > Consider using repartition to increase the number of partitions. > > FYI > > On Fri, May 13, 2016 at 12:14 PM

Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
Hello, I'm using Spark version 1.6.0 and have trouble with memory when trying to do reducebykey on a dataset with as many as 75 million keys. I.e. I get the following exception when I run the task. There are 20 workers in the cluster. It is running under the standalone mode with 12 GB assigned

How Spark handles dead machines during a job.

2016-04-08 Thread Sung Hwan Chung
Hello, Say, that I'm doing a simple rdd.map followed by collect. Say, also, that one of the executors finish all of its tasks, but there are still other executors running. If the machine that hosted the finished executor gets terminated, does the master still have the results from the finished

Re: Executor shutdown hooks?

2016-04-06 Thread Sung Hwan Chung
ave interruptOnCancel set, then you > can catch the interrupt exception within the Task. > > On Wed, Apr 6, 2016 at 12:24 PM, Sung Hwan Chung <coded...@gmail.com> > wrote: > >> Hi, >> >> I'm looking for ways to add shutdown hooks to executors : i.e

Executor shutdown hooks?

2016-04-06 Thread Sung Hwan Chung
Hi, I'm looking for ways to add shutdown hooks to executors : i.e., when a Job is forcefully terminated before it finishes. The scenario goes likes this : executors are running a long running job within a 'map' function. The user decides to terminate the job, then the mappers should perform some

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
w > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 28 March 2016 at 23:30, Sung Hwan Chung <coded...@cs.stanford.edu> > wrote: > >> Yea, that seems to be the case. It seems that dynam

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
d6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 28 March 2016 at 22:58, Sung Hwan Chung <coded...@cs.stanford.edu> > wrote: > >> It seems that the conf/slaves file is only for consumption by the >> following scripts: &g

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
It seems that the conf/slaves file is only for consumption by the following scripts: sbin/start-slaves.sh sbin/stop-slaves.sh sbin/start-all.sh sbin/stop-all.sh I.e., conf/slaves file doesn't affect a running cluster. Is this true? On Mon, Mar 28, 2016 at 9:31 PM, Sung Hwan Chung <co

Re: Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
dOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 28 March 2016 at 22:06, Sung Hwan Chung <coded...@cs.stanford.edu> > wrote: > >> Hello, >> >>

Dynamically adding/removing slaves throuh start-slave.sh and stop-slave.sh

2016-03-28 Thread Sung Hwan Chung
Hello, I found that I could dynamically add/remove new workers to a running standalone Spark cluster by simply triggering: start-slave.sh (SPARK_MASTER_ADDR) and stop-slave.sh E.g., I could instantiate a new AWS instance and just add it to a running cluster without needing to add it to slaves

Parquet StringType column readable as plain-text despite being Gzipped

2016-02-03 Thread Sung Hwan Chung
Hello, We are using the default compression codec for Parquet when we store our dataframes. The dataframe has a StringType column whose values can be upto several MBs large. The funny thing is that once it's stored, we can browse the file content with a plain text editor and see large portions

Is spark-ec2 going away?

2016-01-27 Thread Sung Hwan Chung
I noticed that in the main branch, the ec2 directory along with the spark-ec2 script is no longer present. Is spark-ec2 going away in the next release? If so, what would be the best alternative at that time? A couple more additional questions: 1. Is there any way to add/remove additional workers

Re: Is spark-ec2 going away?

2016-01-27 Thread Sung Hwan Chung
, Alexander Pivovarov <apivova...@gmail.com> wrote: > you can use EMR-4.3.0 run on spot instances to control the price > > yes, you can add/remove instances to the cluster on fly (CORE instances > support add only, TASK instances - add and remove) > > > > On Wed, Jan 27, 20

Re: Is spark-ec2 going away?

2016-01-27 Thread Sung Hwan Chung
rol the price >> >> yes, you can add/remove instances to the cluster on fly (CORE instances >> support add only, TASK instances - add and remove) >> >> >> >> On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung < >> coded...@cs.stanford.edu> wrote: >&

Re: java.io.IOException Error in task deserialization

2014-10-10 Thread Sung Hwan Chung
I haven't seen this at all since switching to HttpBroadcast. It seems TorrentBroadcast might have some issues? On Thu, Oct 9, 2014 at 4:28 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I don't think that I saw any other error message. This is all I saw. I'm currently experimenting

Spark job (not Spark streaming) doesn't delete un-needed checkpoints.

2014-10-10 Thread Sung Hwan Chung
Un-needed checkpoints are not getting automatically deleted in my application. I.e. the lineage looks something like this and checkpoints simply accumulate in a temporary directory (every lineage point, however, does zip with a globally permanent): PermanentRDD:Global zips with all the

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sung Hwan Chung
the ordering deterministic. On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Let's say you have some rows in a dataset (say X partitions initially). A B C D E . . . . You repartition to Y X, then it seems that any of the following could

Re: java.io.IOException Error in task deserialization

2014-10-09 Thread Sung Hwan Chung
I don't think that I saw any other error message. This is all I saw. I'm currently experimenting to see if this can be alleviated by using HttpBroadcastFactory instead of TorrentBroadcast. So far, with HttpBroadcast, I haven't seen this recurring as of yet. I'll keep you posted. On Thu, Oct 9,

Intermittent checkpointing failure.

2014-10-09 Thread Sung Hwan Chung
I'm getting DFS closed channel exception every now and then when I run checkpoint. I do checkpointing every 15 minutes or so. This happens usually after running the job for 1~2 hours. Anyone seen this before? Job aborted due to stage failure: Task 6 in stage 70.0 failed 4 times, most recent

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Sung Hwan Chung
: Using a var for RDDs in this way is not going to work. In this example, tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after that, you change what tx2 means, so you would end up having a circular dependency. On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung coded

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Sung Hwan Chung
, Sung Hwan Chung coded...@cs.stanford.edu wrote: My job is not being fault-tolerant (e.g., when there's a fetch failure or something). The lineage of RDDs are constantly updated every iteration. However, I think that when there's a failure, the lineage information is not being correctly

coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Sung Hwan Chung
I noticed that repartition will result in non-deterministic lineage because it'll result in changed orders for rows. So for instance, if you do things like: val data = read(...) val k = data.repartition(5) val h = k.repartition(5) It seems that this results in different ordering of rows for 'k'

Re: java.io.IOException Error in task deserialization

2014-10-08 Thread Sung Hwan Chung
This is also happening to me on a regular basis, when the job is large with relatively large serialized objects used in each RDD lineage. A bad thing about this is that this exception always stops the whole job. On Fri, Sep 26, 2014 at 11:17 AM, Brad Miller bmill...@eecs.berkeley.edu wrote:

Is RDD partition index consistent?

2014-10-06 Thread Sung Hwan Chung
Is the RDD partition index you get when you call mapPartitionWithIndex consistent under fault-tolerance condition? I.e. 1. Say index is 1 for one of the partitions when you call data.mapPartitionWithIndex((index, rows) = ) // Say index is 1 2. The partition fails (maybe a long with a bunch

Spark fault tolerance after a executor failure.

2014-07-30 Thread Sung Hwan Chung
I sometimes see that after fully caching the data, if one of the executors fails for some reason, that portion of cache gets lost and does not get re-cached, even though there are plenty of resources. Is this a bug or a normal behavior (V1.0.1)?

Re: Getting the number of slaves

2014-07-28 Thread Sung Hwan Chung
Do getExecutorStorageStatus and getExecutorMemoryStatus both return the number of executors + the driver? E.g., if I submit a job with 10 executors, I get 11 for getExeuctorStorageStatus.length and getExecutorMemoryStatus.size On Thu, Jul 24, 2014 at 4:53 PM, Nicolas Mai nicolas@gmail.com

collect on partitions get very slow near the last few partitions.

2014-06-28 Thread Sung Hwan Chung
I'm doing something like this: rdd.groupBy.map().collect() The work load on final map is pretty much evenly distributed. When collect happens, say on 60 partitions, the first 55 or so partitions finish very quickly say within 10 seconds. However, the last 5, particularly the very last one,

Re: collect on partitions get very slow near the last few partitions.

2014-06-28 Thread Sung Hwan Chung
at 11:35 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I'm doing something like this: rdd.groupBy.map().collect() The work load on final map is pretty much evenly distributed. When collect happens, say on 60 partitions, the first 55 or so partitions finish very quickly say within 10

Number of executors smaller than requested in YARN.

2014-06-25 Thread Sung Hwan Chung
Hi, When I try requesting a large number of executors - e.g. 242, it doesn't seem to actually reach that number. E.g., under the executors tab, I only see an executor ID of upto 234. This despite the fact that there're plenty more memory available as well as CPU cores, etc in the system. In

Does Spark restart cached workers even without failures?

2014-06-25 Thread Sung Hwan Chung
I'm doing coalesce with shuffle, cache and then do thousands of iterations. I noticed that sometimes Spark would for no particular reason perform partial coalesce again after running for a long time - and there was no exception or failure on the worker's part. Why is this happening?

Spark executor error

2014-06-25 Thread Sung Hwan Chung
I'm seeing the following message in the log of an executor. Anyone seen this error? After this, the executor seems to lose the cache, and but besides that the whole thing slows down drastically - I.e. it gets stuck in a reduce phase for 40+ minutes, whereas before it was finishing reduces in 2~3

Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread Sung Hwan Chung
Hello, I noticed that the final reduce function happens in the driver node with a code that looks like the following. val outputMap = mapPartition(domsomething).reduce(a: Map, b: Map) { a.merge(b) } although individual outputs from mappers are small. Over time the aggregated result outputMap

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread Sung Hwan Chung
I suppose what I want is the memory efficiency of toLocalIterator and the speed of collect. Is there any such thing? On Mon, Jun 9, 2014 at 3:19 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Hello, I noticed that the final reduce function happens in the driver node with a code

When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. When these happen things get extremely slow. Does this mean that the executor got terminated and restarted? Is there a way to prevent this from happening

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
that this is the case? On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. When these happen things get extremely slow. Does this mean

Spark assembly error.

2014-06-04 Thread Sung Hwan Chung
When I run sbt/sbt assembly, I get the following exception. Is anyone else experiencing a similar problem? .. [info] Resolving org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016 ... [info] Updating {file:/Users/Sung/Projects/spark_06_04_14/}assembly... [info] Resolving

Re: Spark assembly error.

2014-06-04 Thread Sung Hwan Chung
Nevermind, it turns out that this is a problem for the Pivotal Hadoop that we are trying to compile against. On Wed, Jun 4, 2014 at 4:16 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: When I run sbt/sbt assembly, I get the following exception. Is anyone else experiencing a similar

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
to actually serialize singletons and pass it back and forth in a weird manner. On Mon, Apr 28, 2014 at 1:23 AM, Sean Owen so...@cloudera.com wrote: On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: e.g. something like rdd.mapPartition((rows : Iterator[String

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
:33 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Yes, this is what we've done as of now (if you read earlier threads). And we were saying that we'd prefer if Spark supported persistent worker memory management in a little bit less hacky way ;) On Mon, Apr 28, 2014 at 8:44 AM, Ian

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Sung Hwan Chung
at 11:15 AM, Marcelo Vanzin van...@cloudera.comwrote: Hi Sung, On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: The goal is to keep an intermediate value per row in memory, which would allow faster subsequent computations. I.e., computeSomething would

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
right ? A feature extraction algorithm like matrix factorization and it's variants could be used to decrease feature space as well... On Fri, Apr 18, 2014 at 10:53 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Thanks for the info on mem requirement. I think that a lot of businesses

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
. YARN integration is actually complete in CDH5.0. We support it as well as standalone mode. On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen so...@cloudera.com wrote: On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Debasish, Unfortunately, we are bound

Re: Random Forest on Spark

2014-04-18 Thread Sung Hwan Chung
Sorry, that was incomplete information, I think Spark's compression helped (not sure how much though) that the actual memory requirement may have been smaller. On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: I would argue that memory in clusters is still

Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-18 Thread Sung Hwan Chung
Are there scenarios where the developers have to be aware of how Spark's fault tolerance works to implement correct programs? It seems that if we want to maintain any sort of mutable state in each worker through iterations, it can have some unintended effect once a machine goes down. E.g.,

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
Debasish, we've tested the MLLib decision tree a bit and it eats up too much memory for RF purposes. Once the tree got to depth 8~9, it was easy to get heap exception, even with 2~4 GB of memory per worker. With RF, it's very easy to get 100+ depth in RF with even only 100,000+ rows (because

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
memory than 2-4GB per worker for most big data workloads. On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Debasish, we've tested the MLLib decision tree a bit and it eats up too much memory for RF purposes. Once the tree got to depth 8~9, it was easy

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
with the assessment that forests are a variance reduction technique, but I'd be a little surprised if a bunch of hugely deep trees don't overfit to training data. I guess I view limiting tree depth as an analogue to regularization in linear models. On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung coded

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
to be supported at the tree level. On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: Well, if you read the original paper, http://oz.berkeley.edu/~breiman/randomforest2001.pdf Grow the tree using CART methodology to maximum size and do not prune. Now, the elements

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
...do the paper also propose to grow a shallow tree ? Thanks. Deb On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Additionally, the 'random features per node' (or mtry in R) is a very important feature for Random Forest. The variance reduction comes

Re: Random Forest on Spark

2014-04-17 Thread Sung Hwan Chung
this feedback into account with respect to improving the tree implementation, but if anyone can send over use cases or (even better) datasets where really deep trees are necessary, that would be great! On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Well, if you

Mutable tagging RDD rows ?

2014-03-28 Thread Sung Hwan Chung
Hey guys, I need to tag individual RDD lines with some values. This tag value would change at every iteration. Is this possible with RDD (I suppose this is sort of like mutable RDD, but it's more) ? If not, what would be the best way to do something like this? Basically, we need to keep mutable

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
? -Sandy On Wed, Mar 26, 2014 at 1:17 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Hello, (this is Yarn related) I'm able to load an external jar and use its classes within ApplicationMaster. I wish to use this jar within worker nodes, so I added sc.addJar(pathToJar) and ran. I get

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
, Sung Hwan Chung coded...@cs.stanford.edu wrote: Yea it's in a standalone mode and I did use SparkContext.addJar method and tried setting setExecutorEnv SPARK_CLASSPATH, etc. but none of it worked. I finally made it work by modifying the ClientBase.scala code where I set 'appMasterOnly

YARN problem using an external jar in worker nodes Inbox x

2014-03-26 Thread Sung Hwan Chung
Hello, (this is Yarn related) I'm able to load an external jar and use its classes within ApplicationMaster. I wish to use this jar within worker nodes, so I added sc.addJar(pathToJar) and ran. I get the following exception: org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4