RE: carbondata and idempotence

2016-09-27 Thread vincent
Hi
thanks for your answer. My question is about both streaming and batch. Even
in batch if a worker crash or if speculation is activated, the worker's task
that failed will be relaunched on another worker. For example the worker has
crashed after having ingested 20 000 lines on the 100 000 lines of the task,
then the new worker will write the entire 100 000 lines and then resulting
in 20 000 duplicated entries in the storage layer.
This issue is generally managed by using primary key or transactions so the
new task will override the 20 000 lines, or the transaction of the first 20
000 lines would be rolled back.



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-and-idempotence-tp1416p1518.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


carbondata and idempotence

2016-09-23 Thread vincent gromakowski
Hi Carbondata community,
I am evaluating various file format right now and found Carbondata to be
interesting specially with the multiple index used to avoid full scan but I
am asking if there is any way to achieve idem potence when writing to
Carbondata from Spark (or alternative) ?
A strong requirement is to avoid a Spark worker crash to write duplicated
entries in Carbon...
Tx

Vincent