Hi there, Our system generates a lot of small files in Avro format with the same Schema and sends them to the Flume via Thrift RPC. Our Flume agent has the following configuration:
agent.channels=ch1 agent.sources=thrift-source1 agent.sinks=s3-sink1 agent.channels.ch1.type=file agent.channels.ch1.checkpointDir=/flume/ch1/checkpoint agent.channels.ch1.dataDirs=/flume/ch1/data agent.sources.thrift-source1.channels=ch1 agent.sources.thrift-source1.type=thrift agent.sources.thrift-source1.bind=0.0.0.0 agent.sources.thrift-source1.threads=5 agent.sources.thrift-source1.port=1026 agent.sinks.s3-sink1.channel=ch1 agent.sinks.s3-sink1.type=hdfs agent.sinks.s3-sink1.hdfs.path=s3n://bucket/path/ agent.sinks.s3-sink1.hdfs.filePrefix=documents agent.sinks.s3-sink1.hdfs.fileSuffix=.avro agent.sinks.s3-sink1.hdfs.rollInterval =0 agent.sinks.s3-sink1.hdfs.rollSize=20971520 agent.sinks.s3-sink1.hdfs.rollCount=0 agent.sinks.s3-sink1.hdfs.batchSize=10 agent.sinks.s3-sink1.hdfs.fileType=DataStream agent.sinks.s3-sink1.hdfs.useLocalTimeStamp=true Currently Flume just concatenate all Avro files to the single file, as result I have one big file, where Schema and other Avro specific metadata written multiple times. How can I configure Flume to generate valid Avro container file, where schema is written once and which contains Avro datum (without metadata) from all small files (the schema for all files are the same). Thanks, Andrei.
