Re: Failure handling

2017-01-25 Thread Erwan ALLAIN
; On Tue, Jan 24, 2017 at 7:50 AM, Erwan ALLAIN <eallain.po...@gmail.com > <javascript:;>> wrote: > > Hello guys, > > > > I have a question regarding how spark handle failure. > > > > I’m using kafka direct stream > > Spark 2.0.2 > >

Failure handling

2017-01-24 Thread Erwan ALLAIN
Hello guys, I have a question regarding how spark handle failure. I’m using kafka direct stream Spark 2.0.2 Kafka 0.10.0.1 Here is a snippet of code val stream = createDirectStream(….) stream .map(…) .forEachRDD( doSomething) stream .map(…) .forEachRDD( doSomethingElse) The execution is in

How to use logback

2016-11-28 Thread Erwan ALLAIN
Hello, In my project, I would like to use logback as logging framework ( faster, memory footprint, etc ...) I have managed to make it work however I had to modify the spark jars folder - remove slf4j-log4jxx.jar - add logback-classic / logback-core.jar And add logback.xml in conf folder. Is it

Application config management

2016-11-09 Thread Erwan ALLAIN
Hi everyone, I d like to know what kind of configuration mechanism is used in general ? Below is what I m going to implement but I d like to know if there is any "standard way" 1) put configuration in hdfs 2) specify extrajavaoptions (driver and worker) with the hdfs url (

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Erwan ALLAIN
e to expire items from the map) > > > > On Fri, Oct 21, 2016 at 3:32 AM, Erwan ALLAIN <eallain.po...@gmail.com> > wrote: > > Hi, > > > > I'm currently implementing an exactly once mechanism based on the > following > > example: > > >

Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Erwan ALLAIN
Hi, I'm currently implementing an exactly once mechanism based on the following example: https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala the pseudo code is as follow: dstream.transform (store offset in a variable on driver side )

Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Erwan ALLAIN
Hi I'm working with - Kafka 0.8.2 - Spark Streaming (2.0) direct input stream. - cassandra 3.0 My batch interval is 1s. When I use some map, filter even saveToCassandra functions, the processing time is around 50ms on empty batches => This is fine. As soon as I use some reduceByKey, the

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
ers (without splitbrain problem, etc). I also don't know how > well ZK will work cross-datacenter. > > As far as the spark side of things goes, if it's idempotent, why not just > run both instances all the time. > > > > On Tue, Apr 19, 2016 at 10:21 AM, Erwan ALLAIN <eal

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
ninger.org> >> wrote: >> >>> The current direct stream only handles exactly the partitions >>> specified at startup. You'd have to restart the job if you changed >>> partitions. >>> >>> https://issues.apache.org/jira/browse/SPARK-12177 has the

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
> > On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <eallain.po...@gmail.com> > wrote: > > Hello, > > > > I'm currently designing a solution where 2 distinct clusters Spark (2 > > datacenters) share the same Kafka (Kafka rack aware or manual broker > >

Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Erwan ALLAIN
Hello, I'm currently designing a solution where 2 distinct clusters Spark (2 datacenters) share the same Kafka (Kafka rack aware or manual broker repartition). The aims are - preventing DC crash: using kafka resiliency and consumer group mechanism (or else ?) - keeping consistent offset among

Re: Join and HashPartitioner question

2015-11-16 Thread Erwan ALLAIN
You may need to persist r1 after partitionBy call. second join will be more efficient. On Mon, Nov 16, 2015 at 2:48 PM, Rishi Mishra wrote: > AFAIK and can see in the code both of them should behave same. > > On Sat, Nov 14, 2015 at 2:10 AM, Alexander Pivovarov

Re: Saving offset while reading from kafka

2015-10-23 Thread Erwan ALLAIN
Have a look at this: https://github.com/koeninger/kafka-exactly-once especially: https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala

Re: Best practices to handle corrupted records

2015-10-16 Thread Erwan ALLAIN
ready looked into it and also at 'Try' from which I > got inspired. Thanks for pointing it out anyway! > > #A.M. > > Il giorno 15 ott 2015, alle ore 16:19, Erwan ALLAIN < > eallain.po...@gmail.com> ha scritto: > > What about http://www.scala-lang.org/api/2.9.3/scala/Either.ht

Re: Best practices to handle corrupted records

2015-10-15 Thread Erwan ALLAIN
What about http://www.scala-lang.org/api/2.9.3/scala/Either.html ? On Thu, Oct 15, 2015 at 2:57 PM, Roberto Congiu wrote: > I came to a similar solution to a similar problem. I deal with a lot of > CSV files from many different sources and they are often malformed. >

Re: does KafkaCluster can be public ?

2015-10-07 Thread Erwan ALLAIN
t;> >> >>> >> You can put a class in the org.apache.spark namespace to access >>> anything >>> >> that is private[spark]. You can then make enrichments there to access >>> >> whatever you need. Just beware upgrade pain :) >>> >

does KafkaCluster can be public ?

2015-10-06 Thread Erwan ALLAIN
the same as the KafkaCluster which is private. is it possible to : - add another signature in KafkaUtils ? - make KafkaCluster public ? or do you have any other srmart solution where I don't need to copy/paste KafkaCluster ? Thanks. Regards, Erwan ALLAIN