Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
Cool. Using Ambari to monitor and scale up/down the cluster sounds promising. Thanks for the pointer! Mingyu From: Deepak Sharma <deepakmc...@gmail.com> Date: Monday, December 14, 2015 at 1:53 AM To: cs user <acldstk...@gmail.com> Cc: Mingyu Kim <m...@palantir.com>, &quo

Autoscaling of Spark YARN cluster

2015-12-14 Thread Mingyu Kim
review², and I didn¹t find much else from my search. This might be a general YARN question, but wanted to check if there¹s a solution popular in the Spark community. Any sharing of experience around autoscaling will be helpful! Thanks, Mingyu smime.p7s Description: S/MIME cryptographic signature

Re: compatibility issue with Jersey2

2015-10-13 Thread Mingyu Kim
/SPARK-3996. Would this be reasonable? Mingyu On 10/7/15, 11:26 AM, "Marcelo Vanzin" <van...@cloudera.com> wrote: >Seems like you might be running into >https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_brows

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Mingyu Kim
Cool, we will start from there. Thanks Aaron and Josh! Darin, it¹s likely because the DirectOutputCommitter is compiled with Hadoop 1 classes and you¹re running it with Hadoop 2. org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1, and it became an interface in Hadoop 2. Mingyu

Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Mingyu Kim
I didn’t get any response. It’d be really appreciated if anyone using a special OutputCommitter for S3 can comment on this! Thanks, Mingyu From: Mingyu Kim m...@palantir.commailto:m...@palantir.com Date: Monday, February 16, 2015 at 1:15 AM To: user@spark.apache.orgmailto:user@spark.apache.org

Which OutputCommitter to use for S3?

2015-02-16 Thread Mingyu Kim
with Spark. Thanks, Mingyu

Re: How to make spark partition sticky, i.e. stay with node?

2015-01-23 Thread mingyu
I found a workaround. I can make my auxiliary data a RDD. Partition it and cache it. Later, I can cogroup it with other RDDs and Spark will try to keep the cached RDD partitions where they are and not shuffle them. -- View this message in context:

Re: How to make spark partition sticky, i.e. stay with node?

2015-01-22 Thread mingyu
Also, Setting spark.locality.wait=100 did not work for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-partition-sticky-i-e-stay-with-node-tp21322p21325.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to make spark partition sticky, i.e. stay with node?

2015-01-22 Thread mingyu
partition specific auxiliary data for processing the stream. I noticed that the partitions move among the nodes. I cannot afford to move the large auxiliary data around. Thanks, Mingyu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-partition

Re: How does Spark speculation prevent duplicated work?

2014-07-16 Thread Mingyu Kim
That makes sense. Thanks everyone for the explanations! Mingyu From: Matei Zaharia matei.zaha...@gmail.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Tuesday, July 15, 2014 at 3:00 PM To: user@spark.apache.org user@spark.apache.org Subject: Re: How does Spark speculation

How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
actions are not idempotent. For example, it may be counting a partition twice in case of RDD.count or may be writing a partition to HDFS twice in case of RDD.save*(). How does it prevent this kind of duplicated work? Mingyu smime.p7s Description: S/MIME cryptographic signature

JavaRDD.mapToPair throws NPE

2014-06-24 Thread Mingyu Kim
) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:133 9) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:1 07) Mingyu smime.p7s Description

1.0.1 release plan

2014-06-19 Thread Mingyu Kim
Hi all, Is there any plan for 1.0.1 release? Mingyu smime.p7s Description: S/MIME cryptographic signature

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
union two RDDs, for example, rdd1 = [“a, b, c”], rdd2 = [“1, 2, 3”, “4, 5, 6”], then rdd1.union(rdd2).saveAsTextFile(…) should’ve resulted in a file with three lines “a, b, c”, “1, 2, 3”, and “4, 5, 6” because the partitions from the two reds are concatenated. Mingyu On 4/29/14, 10:55 PM

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
Okay, that makes sense. It’d be great if this can be better documented at some point, because the only way to find out about the resulting RDD row order is by looking at the code. Thanks for the discussion! Mingyu On 4/29/14, 11:59 PM, Patrick Wendell pwend...@gmail.com wrote: I don't think

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Mingyu Kim
. (and, sort is really expensive.) On the other hand, if I can assume, say, “filter” or “map” doesn’t shuffle the rows around, I can do the sort once and assume that the order is retained throughout such operations saving a lot of time from doing unnecessary sorts. Mingyu From: Mark Hamstra m

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
() because map preserves the partition order. RDD order is also what allows me to get the top k out of RDD by doing RDD.sort().take(). Am I misunderstanding it? Or, is it just when RDD is written to disk that the order is not well preserved? Thanks in advance! Mingyu On 1/22/14, 4:46 PM, Patrick