You can also use fluentd. http://fluentd.org/
"Fluentd receives logs as JSON streams, buffers them, and sends them
to other systems like Amazon S3, MongoDB, Hadoop, or other Fluentds."
It has a plugin for pushing into HDFS through fluent-plugin-webhdfs.
https://github.com/fluent/fluent-plugin-webhdfs
It can also handle JSON directly, so it fits in your case.

Thanks,
Tsuyoshi

On Fri, Jan 11, 2013 at 10:03 PM, Bertrand Dechoux <decho...@gmail.com> wrote:
> There is also kafka. http://kafka.apache.org
> "A high-throughput, distributed, publish-subscribe messaging system."
>
> But it does not push into HDFS, you need to launch a job to pull data in.
>
> Regards
>
> Bertrand
>
>
> On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mirko.kae...@gmail.com> wrote:
>>
>> I would suggest to work with Flume, in order to clollect a certain number
>> of files and store it to HDFS in larger chunk or write it directly to HBase,
>> this allows random access later on (if need) otherwise HBase could be an
>> overkill. You can collect data in an MySQL DB and than import regularly via
>> Sqoop.
>>
>> Best
>> Mirko
>>
>>
>> "Every dat flow goes to Hadoop"
>> citation from an unkown source
>>
>> 2013/1/11 Hemanth Yamijala <yhema...@thoughtworks.com>
>>>
>>> Queues in the capacity scheduler are logical data structures into which
>>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler
>>> framework, according to some capacity constraints that can be defined for a
>>> queue.
>>>
>>> So, given your use case, I don't think Capacity Scheduler is going to
>>> directly help you (since you only spoke about data-in, and not processing)
>>>
>>> So, yes something like Flume or Scribe
>>>
>>> Thanks
>>> Hemanth
>>>
>>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>> Your question in unclear: HDFS has no queues for ingesting data (it is
>>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
>>>> components have queues for processing data purposes.
>>>>
>>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ouchwhis...@gmail.com>
>>>> wrote:
>>>> > Hello,
>>>> >
>>>> > I have a hadoop cluster setup of 10 nodes and I an in need of
>>>> > implementing
>>>> > queues in the cluster for receiving high volumes of data.
>>>> > Please suggest what will be more efficient to use in the case of
>>>> > receiving
>>>> > 24 Million Json files.. approx 5 KB each in every 24 hours :
>>>> > 1. Using Capacity Scheduler
>>>> > 2. Implementing RabbitMQ and receive data from them using Spring
>>>> > Integration
>>>> > Data pipe lines.
>>>> >
>>>> > I cannot afford to loose any of the JSON files received.
>>>> >
>>>> > Thanking You,
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Ouch Whisper
>>>> > 010101010101
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>
>
>
> --
> Bertrand Dechoux



-- 
OZAWA Tsuyoshi

Reply via email to