This is something of an anti-pattern to do with Flume, though it is possible.
You need to set the maxBlobLength to something larger than your largest file. You need a custom serializer (org.apache.flume.serialization.EventSerializer$Builder) to keep the files as binary. An easier solution would be to use Apache NiFi (incubating) which is designed for file-based data flow and has support for writing binary files to HDFS. -Joey On Thu, Jan 15, 2015 at 2:40 AM, Riccardo Carè <[email protected]> wrote: > Hello, > > I am new to Flume and I am trying to experiment it by moving binary files > over two agents. > > - The first agent runs on machine A and uses a spooldir source and a thrift > sink. > - The second agent runs on machine B, which is part of a Hadoop cluster. It > has a thrift source and an HDFS sink. > > I have two questions for this configuration: > - I know I have to use the BlobDeserializer$Builder for the source on A, > but which is the correct size for the maxBlobLength parameter? Should it be > less or greater than the expected size of the binary file? > - I did some tests and I found that the transmitted file was corrupted on > HDFS. I think this was caused by the HDFS sink which uses TEXT as default > serializer (I assume it is writing \n characters between one event and the > other). How could I fix this? > > Thank you very much in advance. > > Best regards, > Riccardo > -- Joey Echeverria
