OK, I know it, Thanks. 2015-04-15 0:50 GMT+08:00 Gwen Shapira <[email protected]>:
> Flume is at-least-once system. This means we will never lose data, but > you may get duplicate events on errors. > In the cases you pointed out - the events were written but we still > BACKOFF, you will get duplicate events in the channel or in HDFS. > > You probably want to write a small script to de-duplicate the data in > HDFS, like we do in this example: > > https://github.com/hadooparchitecturebook/clickstream-tutorial/blob/master/03_processing/01_dedup/pig/dedup.pig > > Gwen > > On Tue, Apr 14, 2015 at 9:17 AM, Tao Li <[email protected]> wrote: > > Hi all: > > > > I have a question about "Transaction". For example, KafkaSource code like > > this: > > > > try { > > getChannelProcessor().processEventBatch(eventList); > > consumer.commitOffsets(); > > return Status.READY > > } catch(Exception e) { > > return Status.BACKOFF; > > } > > > > If processEventBatch() succeed, but commitOffsets() failed, will return > > BACKOFF. But the eventList is already write to channel. > > > > ---------------------------------- > > > > Also for HDFSEventSink code like this: > > > > try { > > bucketWriter.append(event); > > bucketWriter.flush(); > > transaction.commit(); > > return Status.READY; > > } catch(Exception e) { > > transaction.rollback(); > > return Status.BACKOFF; > > } > > > > If bucketWriter.flush() succeed, but transaction.commit() failed, will > > transaction.rollback() and return BACKOFF. But the event is already > flush to > > HDFS. > > > > > > 2015-04-15 0:09 GMT+08:00 Tao Li <[email protected]>: > >> > >> Hi all: > >> > >> I have a question about "Transaction". For example, KafkaSource code > like > >> this: > >> try { > >> getChannelProcessor().processEventBatch(eventList); > >> consumer.commitOffsets(); > >> > >> } > > > > >
