from:"Bin Wang"

Anaconda iPython notebook working with CDH Spark

2014-12-28 Thread Bin Wang

Hi there, I have a cluster with CDH5.1 running on top of Redhat6.5, where the default Python version is 2.6. I am trying to set up a proper iPython notebook environment to develop spark application using pyspark. Here

Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang

Hi there, I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and buil

Re: Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang

oups get really old quickly. Added bonus > is that you can use a VPN as an entry into the whole system and your > cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN > since I don't like Cisco nor Juniper (the only two options Amazon provides > for t

Re: Missing Spark URL after staring the master

2014-03-04 Thread Bin Wang

t; > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang wrote: > >> Hi Ognen/Mayur, >> >> Thanks for the reply and it

Spark Streaming Maven Build

2014-03-04 Thread Bin Wang

Hi there, I tried the Kafka WordCount example and it works perfect and the code is pretty straightforward to understand. Can anyone show to me how to start your own maven project with the KafkaWordCount example using minimum-effort. 1. How the pom file should look like (including jar-plugin? ass

Re: Spark and HBase

2014-04-08 Thread Bin Wang

Hi Flavio, I happened to attend, actually attending the 2014 Apache Conf, I heard a project called "Apache Phoenix", which fully leverage HBase and suppose to be 1000x faster than Hive. And it is not memory bounded, in which case sets up a limit for Spark. It is still in the incubating group and t

Re: Spark and HBase

2014-04-08 Thread Bin Wang

od Phoenix is good if I > have to query a simple column of HBase but things get really complicated if > I have to add an index for each column in my table and I store complex > object within the cells. Is it correct? > > Best, > Flavio > > > > > On Tue, Apr 8, 201

Re: Scala vs Python performance differences

2014-04-14 Thread Bin Wang

At least, Spark Streaming doesn't support Python at this moment, right? On Mon, Apr 14, 2014 at 6:48 PM, Andrew Ash wrote: > Hi Spark users, > > I've always done all my Spark work in Scala, but occasionally people ask > about Python and its performance impact vs the same algorithm > implementat

Using Pandas/Scikit Learning in Pyspark

2015-05-08 Thread Bin Wang

Hey there, I have a CDH cluster where the default Python installed on those Redhat Linux are Python2.6. I am thinking about developing a Spark application using pyspark and I want to be able to use Pandas and Scikit learn package. Anaconda Python interpreter has the most funtionalities out of box

Spark on top of YARN Compression in iPython notebook

2015-05-10 Thread Bin Wang

I started a AWS cluster (1master + 3core) and download the prebuilt Spark binary. I downloaded the latest Anaconda Python and started a iPython notebook server by running the command below: ipython notebook --port --profile nbserver --no-browser Then, I try to develop a Spark application

Specify Python interpreter

2015-05-11 Thread Bin Wang

Hey there, I have installed a python interpreter in certain location, say "/opt/local/anaconda". Is there anything that I can specify the Python interpreter while developing in iPython notebook? Maybe a property in the while creating the Sparkcontext? I know that I can put "#!/opt/local/anacond

Re: Specify Python interpreter

2015-05-12 Thread Bin Wang

ook to interactively write Spark application. If you have a better Python solution will can offer a better workflow for interactive spark development. Please share. Bin On Tue, May 12, 2015 at 1:20 AM, Tomas Olsson wrote: > Hi, > You can try > > PYSPARK_DRIVER_PYTHON=/path/to/ipython

Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-25 Thread Bin Wang

I am trying to run the Spark example code HBaseTest from command line using spark-submit instead run-example, in that case, I can learn more how to run spark code in general. However, it told me CLASS_NOT_FOUND about htrace since I am using CDH5.4. I successfully located the htrace jar file but I

How to submit streaming application and exit

2015-07-07 Thread Bin Wang

I'm writing a streaming application and want to use spark-submit to submit it to a YARN cluster. I'd like to submit it in a client node and exit spark-submit after the application is running. Is it possible?

Re: How to submit streaming application and exit

2015-07-08 Thread Bin Wang

for > the lifetime of the streaming app. > > On Wed, Jul 8, 2015 at 1:13 PM, Bin Wang wrote: > >> I'm writing a streaming application and want to use spark-submit to >> submit it to a YARN cluster. I'd like to submit it in a client node and >> exit spar

Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang

I'm using spark streaming with Kafka, and submit it to YARN cluster with mode "yarn-cluster". But it hangs at SparkContext.start(). The Kafka config is right since it can show some events in "Streaming" tab of web UI. The attached file is the screen shot of the "Jobs" tab of web UI. The code in th

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang

Thanks for the help. I set --executor-cores and it works now. I've used --total-executor-cores and don't realize it changed. Tathagata Das 于2015年7月10日周五上午3:11写道： > 1. There will be a long running job with description "start()" as that is > the jobs that is running the receivers. It will never e

Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang

If I write code like this: val rdd = input.map(_.value) val f1 = rdd.filter(_ == 1) val f2 = rdd.filter(_ == 2) ... Then the DAG of the execution may be this: -> Filter -> ... Map -> Filter -> ... But the two filters is operated on the same RDD, which means it could be done by

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang

l 16, 2015 1:33 PM, "Bin Wang" wrote: > >> If I write code like this: >> >> val rdd = input.map(_.value) >> val f1 = rdd.filter(_ == 1) >> val f2 = rdd.filter(_ == 2) >> ... >> >> Then the DAG of the execution may be this: >> >>

Data lost in spark streaming

2015-09-10 Thread Bin Wang

I'm using spark streaming 1.4.0 and have a DStream that have all the data it received. But today the history data in the DStream seems to be lost suddenly. And the application UI also lost the streaming process time and all the related data. Could any give some hint to debug this? Thanks.

Re: Data lost in spark streaming

2015-09-13 Thread Bin Wang

tered receiver for stream 0: Stopped by driver Tathagata Das 于2015年9月13日周日下午4:05写道： > Maybe the driver got restarted. See the log4j logs of the driver before it > restarted. > > On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang wrote: > >> I'm using spark streaming 1.4.0 and h

How to clear Kafka offset in Spark streaming?

2015-09-13 Thread Bin Wang

Hi, I'm using spark streaming with kafka and I need to clear the offset and re-compute all things. I deleted checkpoint directory in HDFS and reset kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can confirm the offset is set to 0 in kafka: ~ > kafka-run-class kafka.tools.Consu

Re: How to clear Kafka offset in Spark streaming?

2015-09-14 Thread Bin Wang

I think I've found the reason. It seems that the the smallest offset is not 0 and I should not set the offset to 0. Bin Wang 于2015年9月14日周一下午2:46写道： > Hi, > > I'm using spark streaming with kafka and I need to clear the offset and > re-compute all things. I deleted checkp

How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang

I'd like to know if there is a way to recovery dstream from checkpoint. Because I stores state in DStream, I'd like the state to be recovered when I restart the application and deploy new code.

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang

keeper etc) to keep the state (the indexes etc) and then when you deploy > new code they can be easily recovered. > > Thanks > Best Regards > > On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang wrote: > >> I'd like to know if there is a way to recovery dstream from checkpoint

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang

And here is another question. If I load the DStream from database every time I start the job, will the data be loaded when the job is failed and auto restart? If so, both the checkpoint data and database data are loaded, won't this a problem? Bin Wang 于2015年9月16日周三下午8:40写道： &

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang

j...@mail.gmail.com%3E > > Thanks > Best Regards > > On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang wrote: > >> And here is another question. If I load the DStream from database every >> time I start the job, will the data be loaded when the job is failed and >> auto

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang

m DB > 2. By cleaning the checkpoint in between upgrades, data is loaded > only once > > Hope this helps, > -adrian > > From: Bin Wang > Date: Thursday, September 17, 2015 at 11:27 AM > To: Akhil Das > Cc: user > Subject: Re: How to recovery DStream from

Is it possible to merged delayed batches in streaming?

2015-09-23 Thread Bin Wang

I'm using Spark Streaming and there maybe some delays between batches. I'd like to know is it possible to merge delayed batches into one batch to do processing? For example, the interval is set to 5 min but the first batch uses 1 hour, so there are many batches delayed. In the end of processing fo

Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang

I'd like the spark application to be stopped gracefully while received kill signal, so I add these code: sys.ShutdownHookThread { println("Gracefully stopping Spark Streaming Application") ssc.stop(stopSparkContext = true, stopGracefully = true) println("Application stopped")

Re: Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang

the SparkConf > "spark.streaming.stopGracefullyOnShutdown" to "true" > > Note to self, document this in the programming guide. > > On Wed, Sep 23, 2015 at 3:33 AM, Bin Wang wrote: > >> I'd like the spark application to be stoppe

PySpark read from HBase

2016-08-12 Thread Bin Wang

Hi there, I have lots of raw data in several Hive tables where we built a workflow to "join" those records together and restructured into HBase. It was done using plain MapReduce to generate HFile, and then load incremental from HFile into HBase to guarantee the best performance. However, we need

How to close connection in mapPartitions?

2015-10-22 Thread Bin Wang

I use mapPartitions to open connections to Redis, I write it like this: val seqs = lines.mapPartitions { lines => val cache = new RedisCache(redisUrl, redisPort) val result = lines.map(line => Parser.parseBody(line, cache)) cache.redisPool.close result } But it see

Re: How to close connection in mapPartitions?

2015-10-22 Thread Bin Wang

BTW, "lines" is a DStream. Bin Wang 于2015年10月23日周五下午2:16写道： > I use mapPartitions to open connections to Redis, I write it like this: > > val seqs = lines.mapPartitions { lines => > val cache = new RedisCache(redisUrl, redisPort) > val result = lines.

Anaconda iPython notebook working with CDH Spark

Missing Spark URL after staring the master

Re: Missing Spark URL after staring the master

Re: Missing Spark URL after staring the master

Spark Streaming Maven Build

Re: Spark and HBase

Re: Spark and HBase

Re: Scala vs Python performance differences

Using Pandas/Scikit Learning in Pyspark

Spark on top of YARN Compression in iPython notebook

Specify Python interpreter

Re: Specify Python interpreter

Problem Run Spark Example HBase Code Using Spark-Submit

How to submit streaming application and exit

Re: How to submit streaming application and exit

Spark Streaming Hangs on Start

Re: Spark Streaming Hangs on Start

Will multiple filters on the same RDD optimized to one filter?

Re: Will multiple filters on the same RDD optimized to one filter?

Data lost in spark streaming

Re: Data lost in spark streaming

How to clear Kafka offset in Spark streaming?

Re: How to clear Kafka offset in Spark streaming?

How to recovery DStream from checkpoint directory?

Re: How to recovery DStream from checkpoint directory?

Re: How to recovery DStream from checkpoint directory?

Re: How to recovery DStream from checkpoint directory?

Re: How to recovery DStream from checkpoint directory?

Is it possible to merged delayed batches in streaming?

Dose spark auto invoke StreamingContext.stop while receive kill signal?

Re: Dose spark auto invoke StreamingContext.stop while receive kill signal?

PySpark read from HBase

How to close connection in mapPartitions?

Re: How to close connection in mapPartitions?

34 matches

Site Navigation

Mail list logo

Footer information