RE: Spark on YARN

2015-07-30 Thread Shao, Saisai
You’d better also check the log of nodemanager, sometimes because your memory usage exceeds the limit of Yarn container’s configuration. I’ve met similar problem before, here is the warning log in nodemanager: 2015-07-07 17:06:07,141 WARN

RE: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shao, Saisai
If you’re using WAL with Kafka, Spark Streaming will ignore this configuration(autocommit.enable) and explicitly call commitOffset to update offset to Kafka AFTER WAL is done. No matter what you’re setting with autocommit.enable, internally Spark Streaming will set it to false to turn off

RE: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shao, Saisai
Please see the inline comments. From: Shushant Arora [mailto:shushantaror...@gmail.com] Sent: Monday, July 6, 2015 8:51 PM To: Shao, Saisai Cc: user Subject: Re: kafka offset commit in spark streaming 1.2 So If WAL is disabled, how developer can commit offset explicitly in spark streaming app

RE: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shao, Saisai
commitment mechanism is actually a timer way, so it is asynchronized with replication. From: Shushant Arora [mailto:shushantaror...@gmail.com] Sent: Monday, July 6, 2015 8:30 PM To: Shao, Saisai Cc: user Subject: Re: kafka offset commit in spark streaming 1.2 And what if I disable WAL and use

RE: [SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl

2015-06-09 Thread Shao, Saisai
: Tuesday, June 9, 2015 5:28 PM To: Shao, Saisai; user Subject: RE: [SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl Jerry, I agree with you. However, in my case, I kept the monitoring the blockmanager folder. I do see sometimes the number of files decreased, but the folder's

RE: [SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl

2015-06-09 Thread Shao, Saisai
From the stack I think this problem may be due to the deletion of broadcast variable, as you set the spark.cleaner.ttl, so after this timeout limit, the old broadcast variable will be deleted, you will meet this exception when you want to use it again after that time limit. Basically I think

RE: Possible long lineage issue when using DStream to update a normal RDD

2015-05-08 Thread Shao, Saisai
I think you could use checkpoint to cut the lineage of `MyRDD`, I have a similar scenario and I use checkpoint to workaround this problem :) Thanks Jerry -Original Message- From: yaochunnan [mailto:yaochun...@gmail.com] Sent: Friday, May 8, 2015 1:57 PM To: user@spark.apache.org

RE: Possible long lineage issue when using DStream to update a normal RDD

2015-05-08 Thread Shao, Saisai
...@gmail.com] Sent: Friday, May 8, 2015 2:51 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: Possible long lineage issue when using DStream to update a normal RDD Thank you for this suggestion! But may I ask what's the advantage to use checkpoint instead of cache here? Cuz they both cut lineage

RE: shuffle.FetchFailedException in spark on YARN job

2015-04-20 Thread Shao, Saisai
I don’t think this problem is related to Netty or NIO, switching to nio will not change this part of code path to get the index file for sort-based shuffle reader. I think you could check your system from some aspects: 1. Is there any hardware problem like disk full or others which makes this

RE: Spark Directed Acyclic Graph / Jobs

2015-04-17 Thread Shao, Saisai
I think this paper will be a good resource (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf), also the paper of Dryad is also a good one. Thanks Jerry From: James King [mailto:jakwebin...@gmail.com] Sent: Friday, April 17, 2015 3:26 PM To: user Subject: Spark Directed Acyclic

RE: Spark + Kafka

2015-04-01 Thread Shao, Saisai
OK, seems there’s nothing strange from your code. So maybe we need to narrow down the cause, would you please run KafkaWordCount example in Spark to see if it is OK, if this is OK, then we should focus on your implementation, otherwise Kafka potentially has some problems. Thanks Jerry From:

RE: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Shao, Saisai
it with spark.shuffle.spill=false Thanks Best Regards On Wed, Mar 18, 2015 at 3:39 PM, Darren Hoo darren@gmail.commailto:darren@gmail.com wrote: Thanks, Shao On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai saisai.s...@intel.commailto:saisai.s...@intel.com wrote: Yeah, as I said your job

RE: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Shao, Saisai
to add more resources to your cluster. Thanks Jerry From: Darren Hoo [mailto:darren@gmail.com] Sent: Wednesday, March 18, 2015 3:24 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: [spark-streaming] can shuffle write to disk be disabled? Hi, Saisai Here is the duration of one

RE: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Shao, Saisai
Please see the inline comments. Thanks Jerry From: Darren Hoo [mailto:darren@gmail.com] Sent: Wednesday, March 18, 2015 9:30 PM To: Shao, Saisai Cc: user@spark.apache.org; Akhil Das Subject: Re: [spark-streaming] can shuffle write to disk be disabled? On Wed, Mar 18, 2015 at 8:31 PM, Shao

RE: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Shao, Saisai
Would you please check your driver log or streaming web UI to see each job's latency, including processing latency and total latency. Seems from your code, sliding window is just 3 seconds, so you will process each 60 second's data in 3 seconds, if processing latency is larger than the sliding

RE: MappedStream vs Transform API

2015-03-16 Thread Shao, Saisai
I think these two ways are both OK for you to write streaming job, `transform` is a more general way for you to transform from one DStream to another if there’s no related DStream API (but have related RDD API). But using map maybe more straightforward and easy to understand. Thanks Jerry

RE: Building spark over specified tachyon

2015-03-15 Thread Shao, Saisai
I think you could change the pom file under Spark project to update the Tachyon related dependency version and rebuild it again (in case API is compatible, and behavior is the same). I'm not sure is there any command you can use to compile against Tachyon version. Thanks Jerry From:

RE: Spark Streaming input data source list

2015-03-09 Thread Shao, Saisai
Hi Lin, AFAIK, currently there's no built-in receiver API for RDBMs, but you can customize your own receiver to get data from RDBMs, for the details you can refer to the docs. Thanks Jerry From: Cui Lin [mailto:cui@hds.com] Sent: Tuesday, March 10, 2015 8:36 AM To: Tathagata Das Cc:

RE: distribution of receivers in spark streaming

2015-03-04 Thread Shao, Saisai
Hi Du, You could try to sleep for several seconds after creating streaming context to let all the executors registered, then all the receivers can distribute to the nodes more evenly. Also setting locality is another way as you mentioned. Thanks Jerry From: Du Li

RE: distribution of receivers in spark streaming

2015-03-04 Thread Shao, Saisai
and assigned its hostname to each receiver. Thanks Jerry From: Du Li [mailto:l...@yahoo-inc.com] Sent: Thursday, March 5, 2015 2:29 PM To: Shao, Saisai; User Subject: Re: distribution of receivers in spark streaming Hi Jerry, Thanks for your response. Is there a way to get the list of currently

RE: Having lots of FetchFailedException in join

2015-03-04 Thread Shao, Saisai
Yes, if one key has too many values, there still has a chance to meet the OOM. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Thursday, March 5, 2015 3:49 PM To: Shao, Saisai Cc: Cheng, Hao; user Subject: Re: Having lots of FetchFailedException in join I see. I'm using

RE: Having lots of FetchFailedException in join

2015-03-04 Thread Shao, Saisai
Hi Jianshi, From my understanding, it may not be the problem of NIO or Netty, looking at your stack trace, the OOM is occurred in EAOM(ExternalAppendOnlyMap), theoretically EAOM can spill the data into disk if memory is not enough, but there might some issues when join key is skewed or key

RE: Having lots of FetchFailedException in join

2015-03-04 Thread Shao, Saisai
to read the whole partition into memory. But if you uses SparkSQL, it depends on how SparkSQL uses this operators. CC @hao if he has some thoughts on it. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Thursday, March 5, 2015 3:28 PM To: Shao, Saisai Cc: user Subject: Re

RE: Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Shao, Saisai
Cool, great job☺. Thanks Jerry From: Ryan Williams [mailto:ryan.blake.willi...@gmail.com] Sent: Thursday, February 26, 2015 6:11 PM To: user; d...@spark.apache.org Subject: Monitoring Spark with Graphite and Grafana If anyone is curious to try exporting Spark metrics to Graphite, I just

RE: spark streaming window operations on a large window size

2015-02-23 Thread Shao, Saisai
I don't think current Spark Streaming supports window operations which beyond its available memory, internally Spark Streaming puts all the data in the memory belongs to the effective window, if the memory is not enough, BlockManager will discard the blocks at LRU policy, so something

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
If you call reduceByKey(), internally Spark will introduce a shuffle operations, not matter the data is already partitioned locally, Spark itself do not know the data is already well partitioned. So if you want to avoid Shuffle, you have to write the code explicitly to avoid this, from my

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
I think some RDD APIs like zipPartitions or others can do this as you wanted. I might check the docs. Thanks Jerry From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Monday, February 23, 2015 1:35 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: RE: Union and reduceByKey will trigger

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
I've no context of this book, AFAIK union will not trigger shuffle, as they just put the partitions together, the operator reduceByKey() will actually trigger shuffle. Thanks Jerry From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Monday, February 23, 2015 12:26 PM To: Shao, Saisai Cc

RE: Spark Metrics Servlet for driver and executor

2015-02-06 Thread Shao, Saisai
Hi Judy, For driver, it is /metrics/json, there's no metricsServlet for executor. Thanks Jerry From: Judy Nash [mailto:judyn...@exchange.microsoft.com] Sent: Friday, February 6, 2015 3:47 PM To: user@spark.apache.org Subject: Spark Metrics Servlet for driver and executor Hi all, Looking at

RE: Error KafkaStream

2015-02-05 Thread Shao, Saisai
Did you include Kafka jars? This StringDecoder is under kafka/serializer, You can refer to the unit test KafkaStreamSuite in Spark to see how to use this API. Thanks Jerry From: Eduardo Costa Alfaia [mailto:e.costaalf...@unibs.it] Sent: Friday, February 6, 2015 9:44 AM To: Shao, Saisai Cc: Sean

RE: Error KafkaStream

2015-02-05 Thread Shao, Saisai
Hi, I think you should change the `DefaultDecoder` of your type parameter into `StringDecoder`, seems you want to decode the message into String. `DefaultDecoder` is to return Array[Byte], not String, so here class casting will meet error. Thanks Jerry -Original Message- From:

Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant support in standalone mode, like multi-user management, resource management and isolation, whitelist of users. Seems current

RE: Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Monday, February 2, 2015 4:49 PM To: Shao, Saisai Cc: d...@spark.apache.org; user@spark.apache.org Subject: Re: Questions about Spark standalone resource scheduler Hey Jerry, I think standalone mode will still add more features

RE: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Shao, Saisai
That's definitely a good supplement to the current Spark Streaming, I've heard many guys want to process the data using log time. Looking forward to the code. Thanks Jerry -Original Message- From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Thursday, January 29, 2015 10:33

RE: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Shao, Saisai
Aha, you’re right, I did a wrong comparison, the reason might be only for checkpointing :). Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Wednesday, January 28, 2015 10:39 AM To: Shao, Saisai Cc: user Subject: Re: Why must the dstream.foreachRDD(...) parameter

RE: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Shao, Saisai
Hey Tobias, I think one consideration is for checkpoint of DStream which guarantee driver fault tolerance. Also this `foreachFunc` is more like an action function of RDD, thinking of rdd.foreach(func), in which `func` need to be serializable. So maybe I think your way of use it is not a

RE: Shuffle to HDFS

2015-01-25 Thread Shao, Saisai
Hi Larry, I don’t think current Spark’s shuffle can support HDFS as a shuffle output. Anyway, is there any specific reason to spill shuffle data to HDFS or NFS, this will severely increase the shuffle time. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Sunday, January 25,

RE: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Shao, Saisai
No, current RDD persistence mechanism do not support putting data on HDFS. The directory is spark.local.dirs. Instead you can use checkpoint() to save the RDD on HDFS. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Monday, January 26, 2015 3:08 PM To: Charles Feduke Cc:

RE: Shuffle to HDFS

2015-01-25 Thread Shao, Saisai
, 2015 3:03 PM To: Shao, Saisai Cc: u...@spark.incubator.apache.org Subject: Re: Shuffle to HDFS Hi,Jerry Thanks for your reply. The reason I have this question is that in Hadoop, mapper intermediate output (shuffle) will be stored in HDFS. I think the default location for spark is /tmp I think

RE: spark streaming with checkpoint

2015-01-22 Thread Shao, Saisai
for you? I think it’s better and easy for you to change your implementation rather than rely on Spark to handle this. Thanks Jerry From: Balakrishnan Narendran [mailto:balu.na...@gmail.com] Sent: Friday, January 23, 2015 12:19 AM To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: spark

RE: spark streaming with checkpoint

2015-01-20 Thread Shao, Saisai
Hi, Seems you have such a large window (24 hours), so the phenomena of memory increasing is expectable, because of window operation will cache the RDD within this window in memory. So for your requirement, memory should be enough to hold the data of 24 hours. I don't think checkpoint in Spark

RE: dynamically change receiver for a spark stream

2015-01-20 Thread Shao, Saisai
Hi, I don't think current Spark Streaming support this feature, all the DStream lineage is fixed after the context is started. Also stopping a stream is not supported, instead currently we need to stop the whole streaming context to meet what you want. Thanks Saisai -Original

RE: Streaming with Java: Expected ReduceByWindow to Return JavaDStream

2015-01-18 Thread Shao, Saisai
Hi Jeff, From my understanding it seems more like a bug, since JavaDStreamLike is used for Java code, return a Scala DStream is not reasonable. You can fix this by submitting a PR, or I can help you to fix this. Thanks Jerry From: Jeff Nadler [mailto:jnad...@srcginc.com] Sent: Monday, January

RE: How to replay consuming messages from kafka using spark streaming?

2015-01-14 Thread Shao, Saisai
I think there're two solutions: 1. Enable write ahead log in Spark Streaming if you're using Spark 1.2. 2. Using third-party Kafka consumer (https://github.com/dibbhatt/kafka-spark-consumer). Thanks Saisai -Original Message- From: mykidong [mailto:mykid...@gmail.com] Sent: Thursday,

RE: Better way of measuring custom application metrics

2015-01-04 Thread Shao, Saisai
I started to know your requirement, maybe there’s some limitations in current MetricsSystem, I think we can improve it either. Thanks Jerry From: Enno Shioji [mailto:eshi...@gmail.com] Sent: Sunday, January 4, 2015 5:46 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: Better way

RE: Better way of measuring custom application metrics

2015-01-03 Thread Shao, Saisai
Hi, I think there’s a StreamingSource in Spark Streaming which exposes the Spark Streaming running status to the metrics sink, you can connect it with Graphite sink to expose metrics to Graphite. I’m not sure is this what you want. Besides you can customize the Source and Sink of the

RE: serialization issue with mapPartitions

2014-12-25 Thread Shao, Saisai
Hi, Hadoop Configuration is only Writable, not Java Serializable. You can use SerializableWritable (in Spark) to wrap the Configuration to make it serializable, and use broadcast variable to broadcast this conf to all the node, then you can use it in mapPartitions, rather than serialize it

Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually? There's a thread in mailing list discussed about this

RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Shao, Saisai
Thanks Patrick for your detailed explanation. BR Jerry -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:43 PM To: Cheng, Hao Cc: Shao, Saisai; user@spark.apache.org; d...@spark.apache.org Subject: Re: Question on saveAsTextFile

RE: Spark Streaming Python APIs?

2014-12-14 Thread Shao, Saisai
AFAIK, this will be a new feature in version 1.2, you can check out the master branch or 1.2 branch to take a try. Thanks Jerry From: Xiaoyong Zhu [mailto:xiaoy...@microsoft.com] Sent: Monday, December 15, 2014 10:53 AM To: user@spark.apache.org Subject: Spark Streaming Python APIs? Hi spark

RE: KafkaUtils explicit acks

2014-12-14 Thread Shao, Saisai
Hi, It is not a trivial work to acknowledge the offsets when RDD is fully processed, I think from my understanding only modify the KafakUtils is not enough to meet your requirement, you need to add a metadata management stuff for each block/RDD, and track them both in executor-driver side, and

RE: Spark streaming for v1.1.1 - unable to start application

2014-12-05 Thread Shao, Saisai
Hi, I don’t think it’s a problem of Spark Streaming, seeing for call stack, it’s the problem when BlockManager starting to initializing itself. Would you mind checking your configuration of Spark, hardware problem, deployment. Mostly I think it’s not the problem of Spark. Thanks Saisai From:

RE: Spark Streaming empty RDD issue

2014-12-04 Thread Shao, Saisai
Hi, According to my knowledge of current Spark Streaming Kafka connector, I think there's no chance for APP user to detect such kind of failure, this will either be done by Kafka consumer with ZK coordinator, either by ReceiverTracker in Spark Streaming, so I think you don't need to take care

RE: Low Level Kafka Consumer for Spark

2014-12-02 Thread Shao, Saisai
Hi Rod, The purpose of introducing WAL mechanism in Spark Streaming as a general solution is to make all the receivers be benefit from this mechanism. Though as you said, external sources like Kafka have their own checkpoint mechanism, instead of storing data in WAL, we can only store

RE: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Shao, Saisai
, November 18, 2014 2:47 AM To: Helena Edelson Cc: Jay Vyas; u...@spark.incubator.apache.org; Tobias Pfeiffer; Shao, Saisai Subject: Re: Spark streaming cannot receive any message from Kafka Hi all, I find the reason of this issue. It seems in the new version, if I do not specify

RE: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Shao, Saisai
Did you configure Spark master as local, it should be local[n], n 1 for local mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Thursday,

RE: Kafka Consumer in Spark Streaming

2014-11-04 Thread Shao, Saisai
Hi, would you mind describing your problem a little more specific. 1. Is the Kafka broker currently has no data feed in? 2. This code will print the lines, but not in the driver side, the code is running in the executor side, so you can check the log in worker dir to see if there’s

RE: MEMORY_ONLY_SER question

2014-11-04 Thread Shao, Saisai
From my understanding, the Spark code use Kryo as a streaming manner for RDD partitions, the deserialization comes with iteration to move forward. But the internal thing of Kryo to deserialize all the object once or incrementally is actually a behavior of Kryo, I guess Kyro will not deserialize

RE: Kafka Consumer in Spark Streaming

2014-11-04 Thread Shao, Saisai
To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: Kafka Consumer in Spark Streaming The Kafka broker definitely has messages coming in. But your #2 point is valid. Needless to say I am a newbie to Spark. I can't figure out where the 'executor' logs would be. How would I find them? All I see

RE: FileNotFoundException in appcache shuffle files

2014-10-28 Thread Shao, Saisai
Hi Ryan, This is an issue from sort-based shuffle, not consolidated hash-based shuffle. I guess mostly this issue occurs when Spark cluster is in abnormal situation, maybe long time of GC pause or some others, you can check the system status or if there’s any other exceptions beside this one.

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
Hi Jianshi, For simulation purpose, I think you can try ConstantInputDStream and QueueInputDStream to convert one RDD or series of RDD into DStream, the first one output the same RDD in each batch duration, and the second one just output a RDD in a queue in each batch duration. You can take a

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
amount of data back to driver. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 2:39 PM To: Shao, Saisai Cc: user@spark.apache.org; Tathagata Das (t...@databricks.com) Subject: Re: RDD to DStream Hi Saisai, I understand it's non-trivial

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
cannot support nested RDD in closure. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 3:30 PM To: Shao, Saisai Cc: user@spark.apache.org; Tathagata Das (t...@databricks.com) Subject: Re: RDD to DStream Ok, back to Scala code, I'm wondering why I

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
Yes, I understand what you want, but maybe hard to achieve without collecting back to driver node. Besides, can we just think of another way to do it. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 4:07 PM To: Shao, Saisai Cc: user

RE: Sort-based shuffle did not work as expected

2014-10-27 Thread Shao, Saisai
Hi, Probably the problem you met is related to this JIRA ticket (https://issues.apache.org/jira/browse/SPARK-3948). It's potentially a Kernel 2.6.32 bug which will make sort-based shuffle failed. I'm not sure your problem is the same as this one, would you mind checking your kernel version?

RE: Spark Hive Snappy Error

2014-10-22 Thread Shao, Saisai
Thanks a lot, I will try to reproduce this in my local settings and dig into the details, thanks for your information. BR Jerry From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] Sent: Wednesday, October 22, 2014 8:35 PM To: Shao, Saisai Cc: arthur.hk.c...@gmail.com; user Subject

RE: Spark Hive Snappy Error

2014-10-22 Thread Shao, Saisai
...@gmail.com [mailto:arthur.hk.c...@gmail.com] Sent: Thursday, October 23, 2014 11:32 AM To: Shao, Saisai Cc: arthur.hk.c...@gmail.com; user Subject: Re: Spark Hive Snappy Error Hi, Please find the attached file. my spark-default.xml # Default system properties included when running spark-submit

RE: Shuffle files

2014-10-20 Thread Shao, Saisai
Hi Song, For what I know in sort-based shuffle. Normally parallel opened file numbers for sort-based shuffle is much smaller than hash-based shuffle. In hash based shuffle, parallel opened file numbers is C * R (where C is core number used and R is the reducer number), as you can see the file

RE: Spark Hive Snappy Error

2014-10-16 Thread Shao, Saisai
Hi Arthur, I think this is a known issue in Spark, you can check (https://issues.apache.org/jira/browse/SPARK-3958). I’m curious about it, can you always reproduce this issue, Is this issue related to some specific data sets, would you mind giving me some information about you workload, Spark

RE: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Shao, Saisai
Hi Abraham, You are correct, the “auto.offset.reset“ behavior in KafkaReceiver is different from original Kafka’s semantics, if you set this configure, KafkaReceiver will clean the related immediately, but for Kafka this configuration is just a hint which will be effective only when offset is

RE: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Shao, Saisai
to keep the same semantics as Kafka, you need to remove the above code path manually and recompile the Spark. Thanks Jerry From: Abraham Jacob [mailto:abe.jac...@gmail.com] Sent: Saturday, October 11, 2014 8:49 AM To: Shao, Saisai Cc: user@spark.apache.org; Sean McNamara Subject: Re: Spark

RE: Error reading from Kafka

2014-10-08 Thread Shao, Saisai
Hi, I think you have to change the code like this to specify the type info, like this: val kafkaStream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicMap,

RE: problem with data locality api

2014-09-28 Thread Shao, Saisai
Hi First conf is used for Hadoop to determine the locality distribution of HDFS file. Second conf is used for Spark, though with the same name, actually they are two different classes. Thanks Jerry From: qinwei [mailto:wei@dewmobile.net] Sent: Sunday, September 28, 2014 2:05 PM To: user

RE: sortByKey trouble

2014-09-24 Thread Shao, Saisai
Hi, SortByKey is only for RDD[(K, V)], each tuple can only has two members, Spark will sort with first member, if you want to use sortByKey, you have to change your RDD[(String, String, String, String)] into RDD[(String, (String, String, String))]. Thanks Jerry -Original Message-

RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Shao, Saisai
Hi, Spark.local.dir is the one used to write map output data and persistent RDD blocks, but the path of file has been hashed, so you cannot directly find the persistent rdd block files, but definitely it will be in this folders on your worker node. Thanks Jerry From: Priya Ch

RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Shao, Saisai
blocks would be persisted to hdfs then will i be able to read the hdfs blocks as i could do in hadoop ? On Tue, Sep 23, 2014 at 5:56 PM, Shao, Saisai [via Apache Spark User List] [hidden email]/user/SendEmail.jtp?type=nodenode=14887i=1 wrote: Hi, Spark.local.dir is the one used to write map output

RE: how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread Shao, Saisai
If you have enough memory, the speed will be faster, within one minutes, since most of the files are cached. Also you can build your Spark project on a mounted ramfs in Linux, this will also speed up the process. Thanks Jerry -Original Message- From: Zhan Zhang

RE: Issues with partitionBy: FetchFailed

2014-09-21 Thread Shao, Saisai
Hi, I’ve also met this problem before, I think you can try to set “spark.core.connection.ack.wait.timeout” to a large value to avoid ack timeout, default is 60 seconds. Sometimes because of GC pause or some other reasons, acknowledged message will be timeout, which will lead to this

RE: Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-18 Thread Shao, Saisai
Hi Rafeeq, I think this situation always occurs when your Spark Streaming application is running in an abnormal situation. Would you mind checking your job processing time in WebUI or log, is the total latency of job processing + job scheduling time larger than batch duration? If your Spark

RE: JMXSink for YARN deployment

2014-09-11 Thread Shao, Saisai
Hi, I’m guessing the problem is that driver or executor cannot get the metrics.properties configuration file in the yarn container, so metrics system cannot load the right sinks. Thanks Jerry From: Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com] Sent: Thursday, September 11, 2014

RE: JMXSink for YARN deployment

2014-09-11 Thread Shao, Saisai
put metrics.properties file? On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.commailto:saisai.s...@intel.com wrote: Hi, I’m guessing the problem is that driver or executor cannot get the metrics.properties configuration file in the yarn container, so metrics system cannot load

RE: Spark streaming: size of DStream

2014-09-09 Thread Shao, Saisai
Hi, Is there any specific scenario which needs to know the RDD numbers in the DStream? According to my knowledge DStream will generate one RDD in each right batchDuration, some old rdd will be remembered for windowing-like function, and will be removed when useless. The hashmap generatedRDDs

RE: Spark streaming: size of DStream

2014-09-09 Thread Shao, Saisai
Hi, I think all the received stream will generate a RDD in each batch duration even there is no data feed in (an empty RDD will be generated). So you cannot use number of RDD to judge whether there is any data received. One way is to do this in DStream/foreachRDD(), like a.foreachRDD { r = if

RE: Spark streaming: size of DStream

2014-09-09 Thread Shao, Saisai
I think you should clarify some things in Spark Streaming: 1. closure in map is running in the remote side, so modify count var will only take effect in remote side. You will always get -1 in driver side. 2. some codes in closure in foreachRDD is lazily executed in each batch duration, while

RE: Setting Kafka parameters in Spark Streaming

2014-09-08 Thread Shao, Saisai
Hi Hemanth, I think there is a bug in this API in Spark 0.8.1, so you will meet this exception when using Java code with this API, this bug is fixed in latest version, as you can see the patch (https://github.com/apache/spark/pull/1508). But it’s only for Kafka 0.8+, as you still use kafka

RE: Setting Kafka parameters in Spark Streaming

2014-09-08 Thread Shao, Saisai
, September 09, 2014 1:19 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: Setting Kafka parameters in Spark Streaming Thanks, Shao, for providing the necessary information. Hemanth On Tue, Sep 9, 2014 at 8:21 AM, Shao, Saisai saisai.s...@intel.commailto:saisai.s...@intel.com wrote: Hi

RE: Trying to run SparkSQL over Spark Streaming

2014-08-21 Thread Shao, Saisai
Hi, StreamSQL (https://github.com/thunderain-project/StreamSQL) is a POC project based on Spark to combine the power of Catalyst and Spark Streaming, to offer people the ability to manipulate SQL on top of DStream as you wanted, this keep the same semantics with SparkSQL as offer a

RE: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread Shao, Saisai
Hi, I don't think there's a NPE issue when using DStream/count() even there is no data feed into Spark Streaming. I tested using Kafka in my local settings, both are OK with and without data consumed. Actually you can see the details in ReceiverInputDStream, even there is no data in this

RE: OutOfMemory Error

2014-08-20 Thread Shao, Saisai
Hi Meethu, The spark.executor.memory is the Java heap size of forked executor process. Increasing the spark.executor.memory can actually increase the runtime heap size of executor process. For the details of Spark configurations, you can check:

RE: Hi

2014-08-20 Thread Shao, Saisai
Hi, Actually several java task threads running in a single executor, not processes, so each executor will only have one JVM runtime which shares with different task threads. Thanks Jerry From: rapelly kartheek [mailto:kartheek.m...@gmail.com] Sent: Wednesday, August 20, 2014 5:29 PM To:

RE: Data loss - Spark streaming and network receiver

2014-08-18 Thread Shao, Saisai
I think Currently Spark Streaming lack a data acknowledging mechanism when data is stored and replicated in BlockManager, so potentially data will be lost even pulled into Kafka, say if data is stored just in BlockGenerator not BM, while in the meantime Kafka itself commit the consumer offset,

RE: spark streaming - lamda architecture

2014-08-14 Thread Shao, Saisai
Hi Ali, Maybe you can take a look at twitter's Summingbird project (https://github.com/twitter/summingbird), which is currently one of the few open source choices of lambda Architecture. There's a undergoing sub-project called summingbird-spark, that might be the one you wanted, might this can

RE: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-05 Thread Shao, Saisai
Hi Rafeeq, I think current Spark Streaming api can offer you the ability to fetch data from Kafka and store to another external store, if you do not care about management of consumer offset manually, there’s no need to use low level api as SimpleConsumer. For Kafka 0.8.1 compatibility, you

RE: spark.streaming.unpersist and spark.cleaner.ttl

2014-07-23 Thread Shao, Saisai
Hi Haopu, Please see the inline comments. Thanks Jerry -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Wednesday, July 23, 2014 3:00 PM To: user@spark.apache.org Subject: spark.streaming.unpersist and spark.cleaner.ttl I have a DStream receiving data from a

RE: spark.streaming.unpersist and spark.cleaner.ttl

2014-07-23 Thread Shao, Saisai
will take a look at DStream.scala although I have no Scala experience. -Original Message- From: Shao, Saisai [mailto:saisai.s...@intel.com] Sent: 2014年7月23日 15:13 To: user@spark.apache.org Subject: RE: spark.streaming.unpersist and spark.cleaner.ttl Hi Haopu, Please see the inline

RE: Executor metrics in spark application

2014-07-22 Thread Shao, Saisai
Hi Denes, I think you can register your customized metrics source into metrics system through metrics.properties, you can take metrics.propertes.template as reference, Basically you can do as follow if you want to monitor on executor:

RE: number of Cached Partitions v.s. Total Partitions

2014-07-22 Thread Shao, Saisai
Yes, it's normal when memory is not enough to put the third partition, as you can see in your attached picture. Thanks Jerry From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Tuesday, July 22, 2014 3:09 PM To: user@spark.apache.org Subject: number of Cached Partitions v.s. Total Partitions

RE: Executor metrics in spark application

2014-07-22 Thread Shao, Saisai
Yeah, I start to know your purpose. Original design purpose of customized metrics source is focused on self-contained source, seems you need to rely on outer variable, so the way you mentioned may be is the only way to register. Besides, as you cannot see the source in Ganglia, I think you can

RE: Some question about SQL and streaming

2014-07-10 Thread Shao, Saisai
Actually we have a POC project which shows the power of combining Spark Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get SchemaDStream. You can take a look at it: https://github.com/thunderain-project/StreamSQL Thanks Jerry From: Tathagata Das

RE: Some question about SQL and streaming

2014-07-10 Thread Shao, Saisai
...@preferred.jp] Sent: Friday, July 11, 2014 10:47 AM To: user@spark.apache.org Subject: Re: Some question about SQL and streaming Hi, On Fri, Jul 11, 2014 at 11:38 AM, Shao, Saisai saisai.s...@intel.commailto:saisai.s...@intel.com wrote: Actually we have a POC project which shows the power of combining

  1   2   >