PySpark read from HBase

2016-08-12 Thread Bin Wang
Hi there, I have lots of raw data in several Hive tables where we built a workflow to "join" those records together and restructured into HBase. It was done using plain MapReduce to generate HFile, and then load incremental from HFile into HBase to guarantee the best performance. However, we

How to close connection in mapPartitions?

2015-10-23 Thread Bin Wang
I use mapPartitions to open connections to Redis, I write it like this: val seqs = lines.mapPartitions { lines => val cache = new RedisCache(redisUrl, redisPort) val result = lines.map(line => Parser.parseBody(line, cache)) cache.redisPool.close result } But it

Re: How to close connection in mapPartitions?

2015-10-23 Thread Bin Wang
BTW, "lines" is a DStream. Bin Wang <wbi...@gmail.com>于2015年10月23日周五 下午2:16写道: > I use mapPartitions to open connections to Redis, I write it like this: > > val seqs = lines.mapPartitions { lines => > val cache = new RedisCache(redisUrl, redisPort) >

Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang
I'd like the spark application to be stopped gracefully while received kill signal, so I add these code: sys.ShutdownHookThread { println("Gracefully stopping Spark Streaming Application") ssc.stop(stopSparkContext = true, stopGracefully = true) println("Application

Is it possible to merged delayed batches in streaming?

2015-09-23 Thread Bin Wang
I'm using Spark Streaming and there maybe some delays between batches. I'd like to know is it possible to merge delayed batches into one batch to do processing? For example, the interval is set to 5 min but the first batch uses 1 hour, so there are many batches delayed. In the end of processing

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
07.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E > > Thanks > Best Regards > > On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang <wbi...@gmail.com> wrote: > >> And here is another question. If I load the DStream from database every >> time I start the job, will the dat

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
the values preloaded from DB > 2. By cleaning the checkpoint in between upgrades, data is loaded > only once > > Hope this helps, > -adrian > > From: Bin Wang > Date: Thursday, September 17, 2015 at 11:27 AM > To: Akhil Das > Cc: user > Subject: Re: How t

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
torage (like a db or > zookeeper etc) to keep the state (the indexes etc) and then when you deploy > new code they can be easily recovered. > > Thanks > Best Regards > > On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang <wbi...@gmail.com> wrote: > >> I'd like to know if th

How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
I'd like to know if there is a way to recovery dstream from checkpoint. Because I stores state in DStream, I'd like the state to be recovered when I restart the application and deploy new code.

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
And here is another question. If I load the DStream from database every time I start the job, will the data be loaded when the job is failed and auto restart? If so, both the checkpoint data and database data are loaded, won't this a problem? Bin Wang <wbi...@gmail.com>于2015年9月16日周三 下午

How to clear Kafka offset in Spark streaming?

2015-09-14 Thread Bin Wang
Hi, I'm using spark streaming with kafka and I need to clear the offset and re-compute all things. I deleted checkpoint directory in HDFS and reset kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can confirm the offset is set to 0 in kafka: ~ > kafka-run-class

Re: Data lost in spark streaming

2015-09-13 Thread Bin Wang
receiver for stream 0: Stopped by driver Tathagata Das <t...@databricks.com>于2015年9月13日周日 下午4:05写道: > Maybe the driver got restarted. See the log4j logs of the driver before it > restarted. > > On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang <wbi...@gmail.com> wrote: > &

Data lost in spark streaming

2015-09-11 Thread Bin Wang
I'm using spark streaming 1.4.0 and have a DStream that have all the data it received. But today the history data in the DStream seems to be lost suddenly. And the application UI also lost the streaming process time and all the related data. Could any give some hint to debug this? Thanks.

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
16, 2015 1:33 PM, Bin Wang wbi...@gmail.com wrote: If I write code like this: val rdd = input.map(_.value) val f1 = rdd.filter(_ == 1) val f2 = rdd.filter(_ == 2) ... Then the DAG of the execution may be this: - Filter - ... Map - Filter - ... But the two filters

Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
If I write code like this: val rdd = input.map(_.value) val f1 = rdd.filter(_ == 1) val f2 = rdd.filter(_ == 2) ... Then the DAG of the execution may be this: - Filter - ... Map - Filter - ... But the two filters is operated on the same RDD, which means it could be done by

Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
I'm using spark streaming with Kafka, and submit it to YARN cluster with mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config is right since it can show some events in Streaming tab of web UI. The attached file is the screen shot of the Jobs tab of web UI. The code in the

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
Thanks for the help. I set --executor-cores and it works now. I've used --total-executor-cores and don't realize it changed. Tathagata Das t...@databricks.com于2015年7月10日周五 上午3:11写道: 1. There will be a long running job with description start() as that is the jobs that is running the receivers.

Re: How to submit streaming application and exit

2015-07-08 Thread Bin Wang
of the streaming app. On Wed, Jul 8, 2015 at 1:13 PM, Bin Wang wbi...@gmail.com wrote: I'm writing a streaming application and want to use spark-submit to submit it to a YARN cluster. I'd like to submit it in a client node and exit spark-submit after the application is running. Is it possible

How to submit streaming application and exit

2015-07-07 Thread Bin Wang
I'm writing a streaming application and want to use spark-submit to submit it to a YARN cluster. I'd like to submit it in a client node and exit spark-submit after the application is running. Is it possible?

Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-25 Thread Bin Wang
I am trying to run the Spark example code HBaseTest from command line using spark-submit instead run-example, in that case, I can learn more how to run spark code in general. However, it told me CLASS_NOT_FOUND about htrace since I am using CDH5.4. I successfully located the htrace jar file but I

Re: Specify Python interpreter

2015-05-12 Thread Bin Wang
development. Please share. Bin On Tue, May 12, 2015 at 1:20 AM, Tomas Olsson tomas.ols...@mdh.se wrote: Hi, You can try PYSPARK_DRIVER_PYTHON=/path/to/ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook” /path/to//pyspark /Tomas On 11 May 2015, at 22:17, Bin Wang binwang...@gmail.com wrote: Hey

Specify Python interpreter

2015-05-11 Thread Bin Wang
Hey there, I have installed a python interpreter in certain location, say /opt/local/anaconda. Is there anything that I can specify the Python interpreter while developing in iPython notebook? Maybe a property in the while creating the Sparkcontext? I know that I can put #!/opt/local/anaconda

Spark on top of YARN Compression in iPython notebook

2015-05-10 Thread Bin Wang
I started a AWS cluster (1master + 3core) and download the prebuilt Spark binary. I downloaded the latest Anaconda Python and started a iPython notebook server by running the command below: ipython notebook --port --profile nbserver --no-browser Then, I try to develop a Spark

Anaconda iPython notebook working with CDH Spark

2014-12-28 Thread Bin Wang
Hi there, I have a cluster with CDH5.1 running on top of Redhat6.5, where the default Python version is 2.6. I am trying to set up a proper iPython notebook environment to develop spark application using pyspark. Here

Re: Spark and HBase

2014-04-08 Thread Bin Wang
Hi Flavio, I happened to attend, actually attending the 2014 Apache Conf, I heard a project called Apache Phoenix, which fully leverage HBase and suppose to be 1000x faster than Hive. And it is not memory bounded, in which case sets up a limit for Spark. It is still in the incubating group and

Re: Missing Spark URL after staring the master

2014-03-04 Thread Bin Wang
... Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang binwang...@gmail.com wrote: Hi Ognen/Mayur, Thanks for the reply and it is good to know how easy it is to setup Spark on AWS

Spark Streaming Maven Build

2014-03-04 Thread Bin Wang
Hi there, I tried the Kafka WordCount example and it works perfect and the code is pretty straightforward to understand. Can anyone show to me how to start your own maven project with the KafkaWordCount example using minimum-effort. 1. How the pom file should look like (including jar-plugin?