Essentially we are instrumenting distributed applications. The instrumented
message format is defined in an Avro schema. The messages are transported
over a message queue (eg: RabbitMQ) or (eventually) over Flume and dumped
into HDFS from where they are loaded into Hive for querying.

In HDFS we can certainly colocate the data into a small number of files.
But I want to know if we can minimize the network bandwidth by generating
valid messages from the client-side but w/o the schema in the header.

Does that make sense?

Shaq


On Mon, Mar 17, 2014 at 4:17 PM, Sean Busbey <busbey+li...@cloudera.com>wrote:

> Hi Shaq!
>
> Could you describe your use case in more detail?
>
> Generally, HDFS will behave poorly in the face of many small files. Could
> you perhaps colocate several data in one file? This will help both with the
> relative overhead of the schema and the pressure on the HDFS NameNode.
>
> -Sean
>
>
> On Mon, Mar 17, 2014 at 2:55 PM, Salman Haq <shaq....@audaxhealth.com>wrote:
>
>> Hello,
>>
>> I'd like to confirm if there is a recommended way to serialize data to a
>> file but without the schema being written in the file metadata. Assume a
>> reader's schema will be available for deserialization at a later time.
>>
>> My use case requires small-sized datum messages to be serialized and
>> copied to HDFS. The presence of the schema in the message file adds
>> considerable overhead relative to the size of the datum itself.
>>
>> Thank you,
>> Shaq
>>
>>
>

Reply via email to