You can also use fluentd. http://fluentd.org/ "Fluentd receives logs as JSON streams, buffers them, and sends them to other systems like Amazon S3, MongoDB, Hadoop, or other Fluentds." It has a plugin for pushing into HDFS through fluent-plugin-webhdfs. https://github.com/fluent/fluent-plugin-webhdfs It can also handle JSON directly, so it fits in your case.
Thanks, Tsuyoshi On Fri, Jan 11, 2013 at 10:03 PM, Bertrand Dechoux <decho...@gmail.com> wrote: > There is also kafka. http://kafka.apache.org > "A high-throughput, distributed, publish-subscribe messaging system." > > But it does not push into HDFS, you need to launch a job to pull data in. > > Regards > > Bertrand > > > On Fri, Jan 11, 2013 at 1:52 PM, Mirko Kämpf <mirko.kae...@gmail.com> wrote: >> >> I would suggest to work with Flume, in order to clollect a certain number >> of files and store it to HDFS in larger chunk or write it directly to HBase, >> this allows random access later on (if need) otherwise HBase could be an >> overkill. You can collect data in an MySQL DB and than import regularly via >> Sqoop. >> >> Best >> Mirko >> >> >> "Every dat flow goes to Hadoop" >> citation from an unkown source >> >> 2013/1/11 Hemanth Yamijala <yhema...@thoughtworks.com> >>> >>> Queues in the capacity scheduler are logical data structures into which >>> MapReduce jobs are placed to be picked up by the JobTracker / Scheduler >>> framework, according to some capacity constraints that can be defined for a >>> queue. >>> >>> So, given your use case, I don't think Capacity Scheduler is going to >>> directly help you (since you only spoke about data-in, and not processing) >>> >>> So, yes something like Flume or Scribe >>> >>> Thanks >>> Hemanth >>> >>> On Fri, Jan 11, 2013 at 11:34 AM, Harsh J <ha...@cloudera.com> wrote: >>>> >>>> Your question in unclear: HDFS has no queues for ingesting data (it is >>>> a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN >>>> components have queues for processing data purposes. >>>> >>>> On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper <ouchwhis...@gmail.com> >>>> wrote: >>>> > Hello, >>>> > >>>> > I have a hadoop cluster setup of 10 nodes and I an in need of >>>> > implementing >>>> > queues in the cluster for receiving high volumes of data. >>>> > Please suggest what will be more efficient to use in the case of >>>> > receiving >>>> > 24 Million Json files.. approx 5 KB each in every 24 hours : >>>> > 1. Using Capacity Scheduler >>>> > 2. Implementing RabbitMQ and receive data from them using Spring >>>> > Integration >>>> > Data pipe lines. >>>> > >>>> > I cannot afford to loose any of the JSON files received. >>>> > >>>> > Thanking You, >>>> > >>>> > -- >>>> > Regards, >>>> > Ouch Whisper >>>> > 010101010101 >>>> >>>> >>>> >>>> -- >>>> Harsh J >>> >>> >> > > > > -- > Bertrand Dechoux -- OZAWA Tsuyoshi