a way to allow spark job to continue despite task failures?

2015-11-13 Thread Nicolae Marasoiu
Hi, I know a task can fail 2 times and only the 3rd breaks the entire job. I am good with this number of attempts. I would like that after trying a task 3 times, it continues with the other tasks. The job can be "failed", but I want all tasks run. Please see my use case. I read a hadoop

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-02 Thread Nicolae Marasoiu
Hi, Set 10ms and spark.streaming.backpressure.enabled=true This should automatically delay the next batch until the current one is processed, or at least create that balance over a few batches/periods between the consume/process rate vs ingestion rate. Nicu

Re: How to save DataFrame as a Table in Hbase?

2015-10-02 Thread Nicolae Marasoiu
Hi, Phoenix, an SQL coprocessor for HBase has ingestion integration with dataframes in 4.x version. For HBase and RDD in general there are multiple solutions: hbase-spark module by Cloudera, which wil be part of a future HBase release, hbase-rdd by unicredit, and many others. I am not sure if

Re: Kafka Direct Stream

2015-10-01 Thread Nicolae Marasoiu
Hi, If you just need processing per topic, why not generate N different kafka direct streams ? when creating a kafka direct stream you have list of topics - just give one. Then the reusable part of your computations should be extractable as transformations/functions and reused between the

Re: Problem understanding spark word count execution

2015-10-01 Thread Nicolae Marasoiu
ons to give clarity on a similar thing. Nicu From: Kartik Mathur <kar...@bluedata.com> Sent: Thursday, October 1, 2015 10:25 PM To: Nicolae Marasoiu Cc: user Subject: Re: Problem understanding spark word count execution Thanks Nicolae , So In my case all executers

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu
part. From: Andy Dang <nam...@gmail.com> Sent: Wednesday, September 30, 2015 8:17 PM To: Nicolae Marasoiu Cc: user@spark.apache.org Subject: Re: sc.parallelize with defaultParallelism=1 Can't you just load the data from HBase first, and then call sc.parallelize on your datas

Re: [cache eviction] partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
basically need is invalidating some portions of the data which have newer values. The "compute" method should be the same (read with TableInputFormat). Thanks Nicu ____ From: Nicolae Marasoiu <nicolae.maras...@adswizz.com> Sent: Wednesday, September

sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu
Hi, When calling sc.parallelize(data,1), is there a preference where to put the data? I see 2 possibilities: sending it to a worker node, or keeping it on the driver program. I would prefer to keep the data local to the driver. The use case is when I need just to load a bit of data from

Re: partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
the latest data, and just compute on top of that if anything. Basically I guess I need to write my own RDD and implement compute method by sliding on hbase. Thanks, Nicu From: Nicolae Marasoiu <nicolae.maras...@adswizz.com> Sent: Wednesday, September 30, 2015 3

partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
Hi, If I implement a manner to have an up-to-date version of my RDD by ingesting some new events, called RDD_inc (from increment), and I provide a "merge" function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve the state of my RDD by constructing new RDDs all the time,

Re: Problem understanding spark word count execution

2015-09-30 Thread Nicolae Marasoiu
Hi, 2- the end results are sent back to the driver; the shuffles are transmission of intermediate results between nodes such as the -> which are all intermediate transformations. More precisely, since flatMap and map are narrow dependencies, meaning they can usually happen on the local node,