You may find summingbird relevant, I'm still investigating it: https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
On Tue, Jan 7, 2014 at 11:39 AM, Alan Gates <ga...@hortonworks.com> wrote: > I am not wise enough in the ways of Storm to tell you how you should > partition data across bolts. However, there is no need in Hive for all > data for a partition to be in the same file, only in the same directory. > So if each bolt creates a file for each partition and then all those files > are placed in one directory and loaded into Hive it will work. > > Alan. > > On Jan 6, 2014, at 6:26 PM, Chen Wang <chen.apache.s...@gmail.com> wrote: > > > Alan, > > the problem is that the data is partitioned by epoch ten hourly, and i > want all data belong to that partition to be written into one file named > with that partition. How can i share the file writer across different bolt? > should I instruct data within the same partition to the same bolt? > > Thanks, > > Chen > > > > > > On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates <ga...@hortonworks.com> > wrote: > > You shouldn’t need to write each record to a separate file. Each Storm > bolt should be able to write to it’s own file, appending records as it > goes. As long as you only have one writer per file this should be fine. > You can then close the files every 15 minutes (or whatever works for you) > and have a separate job that creates a new partition in your Hive table > with the files created by your bolts. > > > > Alan. > > > > On Jan 2, 2014, at 11:58 AM, Chen Wang <chen.apache.s...@gmail.com> > wrote: > > > >> Guys, > >> I am using storm to read data stream from our socket server, entry by > entry, and then write them to file: one entry per file. At some point, i > need to import the data into my hive table. There are several approaches i > could think of: > >> 1. directly write to hive hdfs file whenever I get the entry(from our > socket server). The problem is that this could be very inefficient, since > we have huge amount of data stream, and I would not want to write to hive > hdfs one by one. > >> Or > >> 2 i can write the entries to files(normal file or hdfs file) on the > disk, and then have a separate job to merge those small files into big one, > and then load them into hive table. > >> The problem with this is, a) how can I merge small files into big files > for hive? b) what is the best file size to upload to hive? > >> > >> I am seeking advice on both approaches, and appreciate your insight. > >> Thanks, > >> Chen > >> > > > > > > -- > > CONFIDENTIALITY NOTICE > > NOTICE: This message is intended for the use of the individual or entity > to > > which it is addressed and may contain information that is confidential, > > privileged and exempt from disclosure under applicable law. If the reader > > of this message is not the intended recipient, you are hereby notified > that > > any printing, copying, dissemination, distribution, disclosure or > > forwarding of this communication is strictly prohibited. If you have > > received this communication in error, please contact the sender > immediately > > and delete it from your system. Thank You. > > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >