Re: how to choose right DStream batch interval

2014-09-05 Thread qihong
factor or failures), then what's the right batch interval? 5 seconds (the worst case)? 2. What will happen to DStream processing if 1 batch took longer than batch interval? Can Spark recover from that? Thanks, Qihong -- View this message in context: http://apache-spark-user-list.100156

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread qihong
the command should be "spark-shell --master spark://:7077". -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13593.html Sent from the Apache Spark User List mailing list archive at Nabble

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread qihong
Since you are using your home computer, so it's probably not reachable by EC2 from internet. You can try to set "spark.driver.host" to your WAN ip, "spark.driver.port" to a fixed port in SparkConf, and open that port in your home network (port forwarding to the computer you are using). see if that

how to setup steady state stream partitions

2014-09-09 Thread qihong
I'm working on a DStream application. The input are sensors' measurements, the data format is There are 10 thousands sensors, and updateStateByKey is used to maintain the states of sensors, the code looks like following: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use s

Re: how to setup steady state stream partitions

2014-09-09 Thread qihong
Thanks for your response. I do have something like: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use sensorId as key val partitionedDStream = keyedDstream.transform(rdd => rdd.partitionBy(new MyPartitioner(...))) val stateDStream = partitionedDStream.updateStateByKey[...](ud

Re: how to choose right DStream batch interval

2014-09-09 Thread qihong
Hi Mayur, Thanks for your response. I did write a simple test that set up a DStream with 5 batches; The batch duration is 1 second, and the 3rd batch will take extra 2 seconds, the output of the test shows that the 3rd batch causes backlog, and spark streaming does catch up on 4th and 5th batch (

Re: How to set java.library.path in a spark cluster

2014-09-09 Thread qihong
Add something like following to spark-env.sh export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH (and remove all 5 exports you listed). Then restart all worker nodes, and try again. Good luck! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-java-library-pa

RE: how to setup steady state stream partitions

2014-09-10 Thread qihong
Thanks for your response! I found that too, and it does the trick! Here's refined code: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use sensorId as key val partitionedDStream = keyedDstream.transform(rdd => rdd.partitionBy(new MyPartitioner(...))) val stateDStream = part

Re: compiling spark source code

2014-09-12 Thread qihong
follow the instruction here: http://spark.apache.org/docs/latest/building-with-maven.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14144.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to initialize StateDStream

2014-09-12 Thread qihong
there's no need to initialize StateDStream. Take a look at example StatefulNetworkWordCount.scala, it's part of spark source code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14146.html Sent from the Apache Spark Us

Re: How to initialize StateDStream

2014-09-13 Thread qihong
I'm not sure what you mean by "previous run". Is it previous batch? or previous run of spark-submit? If it's "previous batch" (spark streaming creates a batch every batch interval), then there's nothing to do. If it's previous run of spark-submit (assuming you are able to save the result somewher

RE: Change RDDs using map()

2014-09-17 Thread qihong
if you want the result as RDD of (key, 1) new_rdd = rdd.filter(x => x._2 == 1) if you want result as RDD of keys (since you know the values are 1), then new_rdd = rdd.filter(x => x._2 == 1).map(x => x._1) x._1 and x._2 are the way of scala to access the key and value from key/value pair.

Re: Dealing with Time Series Data

2014-09-17 Thread qihong
what are you trying to do? generate time series from your data in HDFS, or doing some transformation and/or aggregation from your time series data in HDFS? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dealing-with-Time-Series-Data-tp14275p14482.html Sent

Re: Spark Streaming - batchDuration for streaming

2014-09-17 Thread qihong
Here's official spark document about batch size/interval: http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-size spark is batch oriented processing. As you mentioned, the streaming is continuous flow, and core spark can not handle it. Spark streaming br

Is possible to invoke updateStateByKey twice on the same RDD

2014-10-09 Thread qihong
I need to implement following logic in a spark streaming app: for the incoming dStream, do some transformation, and invoke updateStateByKey to update state object for each key (mark data entries that are updated as dirty for next step), then let state objects produce event(s) based (based on dirt