Essentially we are instrumenting distributed applications. The instrumented message format is defined in an Avro schema. The messages are transported over a message queue (eg: RabbitMQ) or (eventually) over Flume and dumped into HDFS from where they are loaded into Hive for querying.
In HDFS we can certainly colocate the data into a small number of files. But I want to know if we can minimize the network bandwidth by generating valid messages from the client-side but w/o the schema in the header. Does that make sense? Shaq On Mon, Mar 17, 2014 at 4:17 PM, Sean Busbey <busbey+li...@cloudera.com>wrote: > Hi Shaq! > > Could you describe your use case in more detail? > > Generally, HDFS will behave poorly in the face of many small files. Could > you perhaps colocate several data in one file? This will help both with the > relative overhead of the schema and the pressure on the HDFS NameNode. > > -Sean > > > On Mon, Mar 17, 2014 at 2:55 PM, Salman Haq <shaq....@audaxhealth.com>wrote: > >> Hello, >> >> I'd like to confirm if there is a recommended way to serialize data to a >> file but without the schema being written in the file metadata. Assume a >> reader's schema will be available for deserialization at a later time. >> >> My use case requires small-sized datum messages to be serialized and >> copied to HDFS. The presence of the schema in the message file adds >> considerable overhead relative to the size of the datum itself. >> >> Thank you, >> Shaq >> >> >