Hi,
I know a task can fail 2 times and only the 3rd breaks the entire job.
I am good with this number of attempts.
I would like that after trying a task 3 times, it continues with the other
tasks.
The job can be "failed", but I want all tasks run.
Please see my use case.
I read a hadoop
Hi,
Set 10ms and spark.streaming.backpressure.enabled=true
This should automatically delay the next batch until the current one is
processed, or at least create that balance over a few batches/periods between
the consume/process rate vs ingestion rate.
Nicu
Hi,
Phoenix, an SQL coprocessor for HBase has ingestion integration with dataframes
in 4.x version.
For HBase and RDD in general there are multiple solutions: hbase-spark module
by Cloudera, which wil be part of a future HBase release, hbase-rdd by
unicredit, and many others.
I am not sure if
Hi,
If you just need processing per topic, why not generate N different kafka
direct streams ? when creating a kafka direct stream you have list of topics -
just give one.
Then the reusable part of your computations should be extractable as
transformations/functions and reused between the
ons to give clarity on a similar thing.
Nicu
From: Kartik Mathur <kar...@bluedata.com>
Sent: Thursday, October 1, 2015 10:25 PM
To: Nicolae Marasoiu
Cc: user
Subject: Re: Problem understanding spark word count execution
Thanks Nicolae ,
So In my case all executers
part.
From: Andy Dang <nam...@gmail.com>
Sent: Wednesday, September 30, 2015 8:17 PM
To: Nicolae Marasoiu
Cc: user@spark.apache.org
Subject: Re: sc.parallelize with defaultParallelism=1
Can't you just load the data from HBase first, and then call sc.parallelize on
your datas
basically need is invalidating some portions of the data which
have newer values. The "compute" method should be the same (read with
TableInputFormat).
Thanks
Nicu
____
From: Nicolae Marasoiu <nicolae.maras...@adswizz.com>
Sent: Wednesday, September
Hi,
When calling sc.parallelize(data,1), is there a preference where to put the
data? I see 2 possibilities: sending it to a worker node, or keeping it on the
driver program.
I would prefer to keep the data local to the driver. The use case is when I
need just to load a bit of data from
the latest data,
and just compute on top of that if anything.
Basically I guess I need to write my own RDD and implement compute method by
sliding on hbase.
Thanks,
Nicu
From: Nicolae Marasoiu <nicolae.maras...@adswizz.com>
Sent: Wednesday, September 30, 2015 3
Hi,
If I implement a manner to have an up-to-date version of my RDD by ingesting
some new events, called RDD_inc (from increment), and I provide a "merge"
function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve
the state of my RDD by constructing new RDDs all the time,
Hi,
2- the end results are sent back to the driver; the shuffles are transmission
of intermediate results between nodes such as the -> which are all intermediate
transformations.
More precisely, since flatMap and map are narrow dependencies, meaning they can
usually happen on the local node,
11 matches
Mail list logo