Anaconda iPython notebook working with CDH Spark

2014-12-28 Thread Bin Wang
Hi there, I have a cluster with CDH5.1 running on top of Redhat6.5, where the default Python version is 2.6. I am trying to set up a proper iPython notebook environment to develop spark application using pyspark. Here

Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang
Hi there, I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and buil

Re: Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang
oups get really old quickly. Added bonus > is that you can use a VPN as an entry into the whole system and your > cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN > since I don't like Cisco nor Juniper (the only two options Amazon provides > for t

Re: Missing Spark URL after staring the master

2014-03-04 Thread Bin Wang
t; > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang wrote: > >> Hi Ognen/Mayur, >> >> Thanks for the reply and it

Spark Streaming Maven Build

2014-03-04 Thread Bin Wang
Hi there, I tried the Kafka WordCount example and it works perfect and the code is pretty straightforward to understand. Can anyone show to me how to start your own maven project with the KafkaWordCount example using minimum-effort. 1. How the pom file should look like (including jar-plugin? ass

Re: Spark and HBase

2014-04-08 Thread Bin Wang
Hi Flavio, I happened to attend, actually attending the 2014 Apache Conf, I heard a project called "Apache Phoenix", which fully leverage HBase and suppose to be 1000x faster than Hive. And it is not memory bounded, in which case sets up a limit for Spark. It is still in the incubating group and t

Re: Spark and HBase

2014-04-08 Thread Bin Wang
od Phoenix is good if I > have to query a simple column of HBase but things get really complicated if > I have to add an index for each column in my table and I store complex > object within the cells. Is it correct? > > Best, > Flavio > > > > > On Tue, Apr 8, 201

Re: Scala vs Python performance differences

2014-04-14 Thread Bin Wang
At least, Spark Streaming doesn't support Python at this moment, right? On Mon, Apr 14, 2014 at 6:48 PM, Andrew Ash wrote: > Hi Spark users, > > I've always done all my Spark work in Scala, but occasionally people ask > about Python and its performance impact vs the same algorithm > implementat

Using Pandas/Scikit Learning in Pyspark

2015-05-08 Thread Bin Wang
Hey there, I have a CDH cluster where the default Python installed on those Redhat Linux are Python2.6. I am thinking about developing a Spark application using pyspark and I want to be able to use Pandas and Scikit learn package. Anaconda Python interpreter has the most funtionalities out of box

Spark on top of YARN Compression in iPython notebook

2015-05-10 Thread Bin Wang
I started a AWS cluster (1master + 3core) and download the prebuilt Spark binary. I downloaded the latest Anaconda Python and started a iPython notebook server by running the command below: ipython notebook --port --profile nbserver --no-browser Then, I try to develop a Spark application

Specify Python interpreter

2015-05-11 Thread Bin Wang
Hey there, I have installed a python interpreter in certain location, say "/opt/local/anaconda". Is there anything that I can specify the Python interpreter while developing in iPython notebook? Maybe a property in the while creating the Sparkcontext? I know that I can put "#!/opt/local/anacond

Re: Specify Python interpreter

2015-05-12 Thread Bin Wang
ook to interactively write Spark application. If you have a better Python solution will can offer a better workflow for interactive spark development. Please share. Bin On Tue, May 12, 2015 at 1:20 AM, Tomas Olsson wrote: > Hi, > You can try > > PYSPARK_DRIVER_PYTHON=/path/to/ipython

Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-25 Thread Bin Wang
I am trying to run the Spark example code HBaseTest from command line using spark-submit instead run-example, in that case, I can learn more how to run spark code in general. However, it told me CLASS_NOT_FOUND about htrace since I am using CDH5.4. I successfully located the htrace jar file but I

How to submit streaming application and exit

2015-07-07 Thread Bin Wang
I'm writing a streaming application and want to use spark-submit to submit it to a YARN cluster. I'd like to submit it in a client node and exit spark-submit after the application is running. Is it possible?

Re: How to submit streaming application and exit

2015-07-08 Thread Bin Wang
for > the lifetime of the streaming app. > > On Wed, Jul 8, 2015 at 1:13 PM, Bin Wang wrote: > >> I'm writing a streaming application and want to use spark-submit to >> submit it to a YARN cluster. I'd like to submit it in a client node and >> exit spar

Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
I'm using spark streaming with Kafka, and submit it to YARN cluster with mode "yarn-cluster". But it hangs at SparkContext.start(). The Kafka config is right since it can show some events in "Streaming" tab of web UI. The attached file is the screen shot of the "Jobs" tab of web UI. The code in th

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
Thanks for the help. I set --executor-cores and it works now. I've used --total-executor-cores and don't realize it changed. Tathagata Das 于2015年7月10日周五 上午3:11写道: > 1. There will be a long running job with description "start()" as that is > the jobs that is running the receivers. It will never e

Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
If I write code like this: val rdd = input.map(_.value) val f1 = rdd.filter(_ == 1) val f2 = rdd.filter(_ == 2) ... Then the DAG of the execution may be this: -> Filter -> ... Map -> Filter -> ... But the two filters is operated on the same RDD, which means it could be done by

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
l 16, 2015 1:33 PM, "Bin Wang" wrote: > >> If I write code like this: >> >> val rdd = input.map(_.value) >> val f1 = rdd.filter(_ == 1) >> val f2 = rdd.filter(_ == 2) >> ... >> >> Then the DAG of the execution may be this: >> >>

Data lost in spark streaming

2015-09-10 Thread Bin Wang
I'm using spark streaming 1.4.0 and have a DStream that have all the data it received. But today the history data in the DStream seems to be lost suddenly. And the application UI also lost the streaming process time and all the related data. Could any give some hint to debug this? Thanks.

Re: Data lost in spark streaming

2015-09-13 Thread Bin Wang
tered receiver for stream 0: Stopped by driver Tathagata Das 于2015年9月13日周日 下午4:05写道: > Maybe the driver got restarted. See the log4j logs of the driver before it > restarted. > > On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang wrote: > >> I'm using spark streaming 1.4.0 and h

How to clear Kafka offset in Spark streaming?

2015-09-13 Thread Bin Wang
Hi, I'm using spark streaming with kafka and I need to clear the offset and re-compute all things. I deleted checkpoint directory in HDFS and reset kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can confirm the offset is set to 0 in kafka: ~ > kafka-run-class kafka.tools.Consu

Re: How to clear Kafka offset in Spark streaming?

2015-09-14 Thread Bin Wang
I think I've found the reason. It seems that the the smallest offset is not 0 and I should not set the offset to 0. Bin Wang 于2015年9月14日周一 下午2:46写道: > Hi, > > I'm using spark streaming with kafka and I need to clear the offset and > re-compute all things. I deleted checkp

How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
I'd like to know if there is a way to recovery dstream from checkpoint. Because I stores state in DStream, I'd like the state to be recovered when I restart the application and deploy new code.

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
keeper etc) to keep the state (the indexes etc) and then when you deploy > new code they can be easily recovered. > > Thanks > Best Regards > > On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang wrote: > >> I'd like to know if there is a way to recovery dstream from checkpoint

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
And here is another question. If I load the DStream from database every time I start the job, will the data be loaded when the job is failed and auto restart? If so, both the checkpoint data and database data are loaded, won't this a problem? Bin Wang 于2015年9月16日周三 下午8:40写道: &

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
j...@mail.gmail.com%3E > > Thanks > Best Regards > > On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang wrote: > >> And here is another question. If I load the DStream from database every >> time I start the job, will the data be loaded when the job is failed and >> auto

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
m DB > 2. By cleaning the checkpoint in between upgrades, data is loaded > only once > > Hope this helps, > -adrian > > From: Bin Wang > Date: Thursday, September 17, 2015 at 11:27 AM > To: Akhil Das > Cc: user > Subject: Re: How to recovery DStream from

Is it possible to merged delayed batches in streaming?

2015-09-23 Thread Bin Wang
I'm using Spark Streaming and there maybe some delays between batches. I'd like to know is it possible to merge delayed batches into one batch to do processing? For example, the interval is set to 5 min but the first batch uses 1 hour, so there are many batches delayed. In the end of processing fo

Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang
I'd like the spark application to be stopped gracefully while received kill signal, so I add these code: sys.ShutdownHookThread { println("Gracefully stopping Spark Streaming Application") ssc.stop(stopSparkContext = true, stopGracefully = true) println("Application stopped")

Re: Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang
the SparkConf > "spark.streaming.stopGracefullyOnShutdown" to "true" > > Note to self, document this in the programming guide. > > On Wed, Sep 23, 2015 at 3:33 AM, Bin Wang wrote: > >> I'd like the spark application to be stoppe

PySpark read from HBase

2016-08-12 Thread Bin Wang
Hi there, I have lots of raw data in several Hive tables where we built a workflow to "join" those records together and restructured into HBase. It was done using plain MapReduce to generate HFile, and then load incremental from HFile into HBase to guarantee the best performance. However, we need

How to close connection in mapPartitions?

2015-10-22 Thread Bin Wang
I use mapPartitions to open connections to Redis, I write it like this: val seqs = lines.mapPartitions { lines => val cache = new RedisCache(redisUrl, redisPort) val result = lines.map(line => Parser.parseBody(line, cache)) cache.redisPool.close result } But it see

Re: How to close connection in mapPartitions?

2015-10-22 Thread Bin Wang
BTW, "lines" is a DStream. Bin Wang 于2015年10月23日周五 下午2:16写道: > I use mapPartitions to open connections to Redis, I write it like this: > > val seqs = lines.mapPartitions { lines => > val cache = new RedisCache(redisUrl, redisPort) > val result = lines.