Hi,
I know a task can fail 2 times and only the 3rd breaks the entire job.
I am good with this number of attempts.
I would like that after trying a task 3 times, it continues with the other
tasks.
The job can be "failed", but I want all tasks run.
Please see my use case.
I read a hadoop in
Hi,
Phoenix, an SQL coprocessor for HBase has ingestion integration with dataframes
in 4.x version.
For HBase and RDD in general there are multiple solutions: hbase-spark module
by Cloudera, which wil be part of a future HBase release, hbase-rdd by
unicredit, and many others.
I am not sure if t
Hi,
Set 10ms and spark.streaming.backpressure.enabled=true
This should automatically delay the next batch until the current one is
processed, or at least create that balance over a few batches/periods between
the consume/process rate vs ingestion rate.
Nicu
Hi,
If you just need processing per topic, why not generate N different kafka
direct streams ? when creating a kafka direct stream you have list of topics -
just give one.
Then the reusable part of your computations should be extractable as
transformations/functions and reused between the st
o give clarity on a similar thing.
Nicu
From: Kartik Mathur
Sent: Thursday, October 1, 2015 10:25 PM
To: Nicolae Marasoiu
Cc: user
Subject: Re: Problem understanding spark word count execution
Thanks Nicolae ,
So In my case all executers are sending results back to
Hi,
2- the end results are sent back to the driver; the shuffles are transmission
of intermediate results between nodes such as the -> which are all intermediate
transformations.
More precisely, since flatMap and map are narrow dependencies, meaning they can
usually happen on the local node,
basically need is invalidating some portions of the data which
have newer values. The "compute" method should be the same (read with
TableInputFormat).
Thanks
Nicu
____
From: Nicolae Marasoiu
Sent: Wednesday, September 30, 2015 4:07 PM
To: user@spark.apache.o
m/r part.
From: Andy Dang
Sent: Wednesday, September 30, 2015 8:17 PM
To: Nicolae Marasoiu
Cc: user@spark.apache.org
Subject: Re: sc.parallelize with defaultParallelism=1
Can't you just load the data from HBase first, and then call sc.parallelize on
your dataset?
-Andy
---
latest data,
and just compute on top of that if anything.
Basically I guess I need to write my own RDD and implement compute method by
sliding on hbase.
Thanks,
Nicu
From: Nicolae Marasoiu
Sent: Wednesday, September 30, 2015 3:05 PM
To: user@spark.apache.org
Hi,
If I implement a manner to have an up-to-date version of my RDD by ingesting
some new events, called RDD_inc (from increment), and I provide a "merge"
function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve
the state of my RDD by constructing new RDDs all the time,
Hi,
When calling sc.parallelize(data,1), is there a preference where to put the
data? I see 2 possibilities: sending it to a worker node, or keeping it on the
driver program.
I would prefer to keep the data local to the driver. The use case is when I
need just to load a bit of data from HBas
11 matches
Mail list logo