Hi there,

Our system generates a lot of small files in Avro format with the same Schema 
and sends them to the Flume via Thrift RPC.
Our Flume agent has the following configuration:

agent.channels=ch1
agent.sources=thrift-source1
agent.sinks=s3-sink1
agent.channels.ch1.type=file
agent.channels.ch1.checkpointDir=/flume/ch1/checkpoint
agent.channels.ch1.dataDirs=/flume/ch1/data
agent.sources.thrift-source1.channels=ch1
agent.sources.thrift-source1.type=thrift
agent.sources.thrift-source1.bind=0.0.0.0
agent.sources.thrift-source1.threads=5
agent.sources.thrift-source1.port=1026
agent.sinks.s3-sink1.channel=ch1
agent.sinks.s3-sink1.type=hdfs
agent.sinks.s3-sink1.hdfs.path=s3n://bucket/path/
agent.sinks.s3-sink1.hdfs.filePrefix=documents
agent.sinks.s3-sink1.hdfs.fileSuffix=.avro
agent.sinks.s3-sink1.hdfs.rollInterval =0
agent.sinks.s3-sink1.hdfs.rollSize=20971520
agent.sinks.s3-sink1.hdfs.rollCount=0
agent.sinks.s3-sink1.hdfs.batchSize=10
agent.sinks.s3-sink1.hdfs.fileType=DataStream
agent.sinks.s3-sink1.hdfs.useLocalTimeStamp=true

Currently Flume just concatenate all Avro files to the single file, as result I 
have one big file, where Schema and other Avro specific metadata written 
multiple times.
How can I configure Flume to generate valid Avro container file, where schema 
is written once and which contains Avro datum (without metadata) from all small 
files (the schema for all files are the same).

Thanks,
Andrei.

Reply via email to