Hi there,
I have lots of raw data in several Hive tables where we built a workflow to
"join" those records together and restructured into HBase. It was done
using plain MapReduce to generate HFile, and then load incremental from
HFile into HBase to guarantee the best performance.
However, we
I use mapPartitions to open connections to Redis, I write it like this:
val seqs = lines.mapPartitions { lines =>
val cache = new RedisCache(redisUrl, redisPort)
val result = lines.map(line => Parser.parseBody(line, cache))
cache.redisPool.close
result
}
But it
BTW, "lines" is a DStream.
Bin Wang <wbi...@gmail.com>于2015年10月23日周五 下午2:16写道:
> I use mapPartitions to open connections to Redis, I write it like this:
>
> val seqs = lines.mapPartitions { lines =>
> val cache = new RedisCache(redisUrl, redisPort)
>
I'd like the spark application to be stopped gracefully while received kill
signal, so I add these code:
sys.ShutdownHookThread {
println("Gracefully stopping Spark Streaming Application")
ssc.stop(stopSparkContext = true, stopGracefully = true)
println("Application
I'm using Spark Streaming and there maybe some delays between batches. I'd
like to know is it possible to merge delayed batches into one batch to do
processing?
For example, the interval is set to 5 min but the first batch uses 1 hour,
so there are many batches delayed. In the end of processing
07.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E
>
> Thanks
> Best Regards
>
> On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang <wbi...@gmail.com> wrote:
>
>> And here is another question. If I load the DStream from database every
>> time I start the job, will the dat
the values preloaded from DB
> 2. By cleaning the checkpoint in between upgrades, data is loaded
> only once
>
> Hope this helps,
> -adrian
>
> From: Bin Wang
> Date: Thursday, September 17, 2015 at 11:27 AM
> To: Akhil Das
> Cc: user
> Subject: Re: How t
torage (like a db or
> zookeeper etc) to keep the state (the indexes etc) and then when you deploy
> new code they can be easily recovered.
>
> Thanks
> Best Regards
>
> On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang <wbi...@gmail.com> wrote:
>
>> I'd like to know if th
I'd like to know if there is a way to recovery dstream from checkpoint.
Because I stores state in DStream, I'd like the state to be recovered when
I restart the application and deploy new code.
And here is another question. If I load the DStream from database every
time I start the job, will the data be loaded when the job is failed and
auto restart? If so, both the checkpoint data and database data are loaded,
won't this a problem?
Bin Wang <wbi...@gmail.com>于2015年9月16日周三 下午
Hi,
I'm using spark streaming with kafka and I need to clear the offset and
re-compute all things. I deleted checkpoint directory in HDFS and reset
kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can
confirm the offset is set to 0 in kafka:
~ > kafka-run-class
receiver
for stream 0: Stopped by driver
Tathagata Das <t...@databricks.com>于2015年9月13日周日 下午4:05写道:
> Maybe the driver got restarted. See the log4j logs of the driver before it
> restarted.
>
> On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang <wbi...@gmail.com> wrote:
>
&
I'm using spark streaming 1.4.0 and have a DStream that have all the data
it received. But today the history data in the DStream seems to be lost
suddenly. And the application UI also lost the streaming process time and
all the related data. Could any give some hint to debug this? Thanks.
16, 2015 1:33 PM, Bin Wang wbi...@gmail.com wrote:
If I write code like this:
val rdd = input.map(_.value)
val f1 = rdd.filter(_ == 1)
val f2 = rdd.filter(_ == 2)
...
Then the DAG of the execution may be this:
- Filter - ...
Map
- Filter - ...
But the two filters
If I write code like this:
val rdd = input.map(_.value)
val f1 = rdd.filter(_ == 1)
val f2 = rdd.filter(_ == 2)
...
Then the DAG of the execution may be this:
- Filter - ...
Map
- Filter - ...
But the two filters is operated on the same RDD, which means it could be
done by
I'm using spark streaming with Kafka, and submit it to YARN cluster with
mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config
is right since it can show some events in Streaming tab of web UI.
The attached file is the screen shot of the Jobs tab of web UI. The code
in the
Thanks for the help. I set --executor-cores and it works now. I've used
--total-executor-cores and don't realize it changed.
Tathagata Das t...@databricks.com于2015年7月10日周五 上午3:11写道:
1. There will be a long running job with description start() as that is
the jobs that is running the receivers.
of the streaming app.
On Wed, Jul 8, 2015 at 1:13 PM, Bin Wang wbi...@gmail.com wrote:
I'm writing a streaming application and want to use spark-submit to
submit it to a YARN cluster. I'd like to submit it in a client node and
exit spark-submit after the application is running. Is it possible
I'm writing a streaming application and want to use spark-submit to submit
it to a YARN cluster. I'd like to submit it in a client node and exit
spark-submit after the application is running. Is it possible?
I am trying to run the Spark example code HBaseTest from command line using
spark-submit instead run-example, in that case, I can learn more how to run
spark code in general.
However, it told me CLASS_NOT_FOUND about htrace since I am using CDH5.4. I
successfully located the htrace jar file but I
development. Please share.
Bin
On Tue, May 12, 2015 at 1:20 AM, Tomas Olsson tomas.ols...@mdh.se wrote:
Hi,
You can try
PYSPARK_DRIVER_PYTHON=/path/to/ipython
PYSPARK_DRIVER_PYTHON_OPTS=notebook” /path/to//pyspark
/Tomas
On 11 May 2015, at 22:17, Bin Wang binwang...@gmail.com wrote:
Hey
Hey there,
I have installed a python interpreter in certain location, say
/opt/local/anaconda.
Is there anything that I can specify the Python interpreter while
developing in iPython notebook? Maybe a property in the while creating the
Sparkcontext?
I know that I can put #!/opt/local/anaconda
I started a AWS cluster (1master + 3core) and download the prebuilt Spark
binary. I downloaded the latest Anaconda Python and started a iPython
notebook server by running the command below:
ipython notebook --port --profile nbserver --no-browser
Then, I try to develop a Spark
Hi there,
I have a cluster with CDH5.1 running on top of Redhat6.5, where the default
Python version is 2.6. I am trying to set up a proper iPython notebook
environment to develop spark application using pyspark.
Here
Hi Flavio,
I happened to attend, actually attending the 2014 Apache Conf, I heard a
project called Apache Phoenix, which fully leverage HBase and suppose to
be 1000x faster than Hive. And it is not memory bounded, in which case sets
up a limit for Spark. It is still in the incubating group and
...
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang binwang...@gmail.com wrote:
Hi Ognen/Mayur,
Thanks for the reply and it is good to know how easy it is to setup Spark
on AWS
Hi there,
I tried the Kafka WordCount example and it works perfect and the code is
pretty straightforward to understand.
Can anyone show to me how to start your own maven project with the
KafkaWordCount example using minimum-effort.
1. How the pom file should look like (including jar-plugin?
27 matches
Mail list logo