RE: Understanding shuffle file name conflicts

2015-03-25 Thread Shao, Saisai
Hi Cheng, I think your scenario is acceptable for Spark's shuffle mechanism and will not occur shuffle file name conflicts. From my understanding I think the code snippet you mentioned is the same RDD graph, just running twice, these two jobs will generate 3 stages, map stage and collect stag

RE: Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Shao, Saisai
Cool, great job☺. Thanks Jerry From: Ryan Williams [mailto:ryan.blake.willi...@gmail.com] Sent: Thursday, February 26, 2015 6:11 PM To: user; dev@spark.apache.org Subject: Monitoring Spark with Graphite and Grafana If anyone is curious to try exporting Spark metrics to Graphite, I just publishe

RE: StreamingContext textFileStream question

2015-02-23 Thread Shao, Saisai
Hi Mark, For input streams like text input stream, only RDDs can be recovered from checkpoint, no missed files, if file is missed, actually an exception will be raised. If you use HDFS, HDFS will guarantee no data loss since it has 3 copies.Otherwise user logic has to guarantee no file deleted

RE: Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
ginal Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Monday, February 2, 2015 4:49 PM To: Shao, Saisai Cc: dev@spark.apache.org; u...@spark.apache.org Subject: Re: Questions about Spark standalone resource scheduler Hey Jerry, I think standalone mode will still add more fea

Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant support in standalone mode, like multi-user management, resource management and isolation, whitelist of users. Seems current Spa

RE: Which committers care about Kafka?

2014-12-29 Thread Shao, Saisai
to failure. Thanks Jerry From: Cody Koeninger [mailto:c...@koeninger.org] Sent: Tuesday, December 30, 2014 6:50 AM To: Tathagata Das Cc: Hari Shreedharan; Shao, Saisai; Sean McNamara; Patrick Wendell; Luis Ángel Vicente Sánchez; Dibyendu Bhattacharya; dev@spark.apache.org; Koert Kuipers Subject

RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Thanks Patrick for your detailed explanation. BR Jerry -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:43 PM To: Cheng, Hao Cc: Shao, Saisai; u...@spark.apache.org; dev@spark.apache.org Subject: Re: Question on saveAsTextFile with

Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this (http://apach

RE: Which committers care about Kafka?

2014-12-18 Thread Shao, Saisai
Hi all, I agree with Hari that Strong exact-once semantics is very hard to guarantee, especially in the failure situation. From my understanding even current implementation of ReliableKafkaReceiver cannot fully guarantee the exact once semantics once failed, first is the ordering of data replay

RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Shao, Saisai
Hi, Spark.local.dir is the one used to write map output data and persistent RDD blocks, but the path of file has been hashed, so you cannot directly find the persistent rdd block files, but definitely it will be in this folders on your worker node. Thanks Jerry From: Priya Ch [mailto:learnin

Spark SQL unit test failed when sort-based shuffle is enabled

2014-08-11 Thread Shao, Saisai
Hi folks, I met several Spark SQL unit test failures when sort-based shuffle is enabled, seems Spark SQL uses GenericMutableRow which will make ExternalSorter's internal buffer all referred to the same object, I guess GenericMutableRow uses only one mutable object to represent different rows, t

RE: Low Level Kafka Consumer for Spark

2014-08-05 Thread Shao, Saisai
Hi, I think this is an awesome feature for Spark Streaming Kafka interface to offer user the controllability of partition offset, so user can have more applications based on this. What I concern is that if we want to do offset management, fault tolerant related control and others, we have to t