@ stack Thank you very much. Your idea is very good. Yes, the monotonically increasing rowkeys are very appropriate to use bulk-load tools to create new regions from bottom to up. I would study HBASE-48 and do these experiments.
This idea should be based on strict monotonically increasing rowkeys. Is it right? Schubert On Sat, Sep 5, 2009 at 1:16 AM, Andrew Purtell <[email protected]> wrote: > Clean this up a bit and smooth the rough edges and this is I think a valid > strategy for handling a continuous high volume import of sequential data > with monotonically increasing row keys generated by some global counter, NTP > synchronized System.currentTimeMillis() for example. To support this > especially when the functions of META are relocated to ZK we should add the > necessary APIs under HBaseAdmin. > > > If by chance they have edits that don't fit the tables current time > range, they'll fail to go in because first and last regions are read-only. > > I'm not sure I understand that phrasing. If the script modifies the > left-most and right-most region (in terms of the key space) to have the > empty row as start and end key respectively, won't any edits to the left or > right of the tables current time range go in? > > - Andy > > > > > ________________________________ > From: stack <[email protected]> > To: [email protected] > Sent: Friday, September 4, 2009 9:50:33 AM > Subject: Time-series data problem (WAS -> Cassandra vs HBase) > > Ok. Thanks for answering the questions. > > Could you do something like the below? > > + One table keyed by timestamp with an additional qualifier of some sort -- > random if you have to -- so each event is a row of its own or a column of > its own inside a row keyed by ts. > + In such a table, new data would always get added to the end of the table; > old data would be removed from the top of the table. > + Catch the incoming events in hdfs. They will be unordered. Catch them > into sequence files (Would 'scribe' help here?) > + Study hbase-48. See how it works. Its a MR job that writes regions each > of which has a single hfile of close to maximum regionsize. > + Run a MR job every 10 minutes that takes all events that have arrived > since the last MR job, sorts them, passes them through the hbase-48 sorting > reducer and in turn out to the hfileoutputformat writer writing regions. > + When the MR job completes, trigger a version of the script that is in > hbase-48 for adding the new regions. It will move into place and add to > .META. regions that have been made by the last MR job. > + On the next run of the meta scanner, it runs every minute, the new > regions > will show in the new table. > + The first and last regions in the table would be managed by an external > script. Aging out regions would just be a matter of deleting mention of > regions from .META. and removing their content from hdfs. The same script > would be responsible for updating the first region in the .META. table > moving its end row to encompass the start row of the new 'first' row in the > table. Some similar juggling would have to be done around the 'last' entry > in .META. -- every time you added regions, the last .META. entry would need > adjusting. These regions would need to be read-only (You can set flag on > these regions from the script) so they didn't take on any writes. > > The MR job writing hfiles directly should be a good bit faster than hbase > taking the edits directly. > > If a hiccup in the system, the spill to hdfs will make it so you don't lose > events. The next MR job that runs will just take a while. > > The batch updates that arrive by other channels that you mention in another > mail should go in fine... If by chance they have edits that don't fit the > tables current time range, they'll fail to go in because first and last > regions are read-only. > > You should run with bigger regions as the lads suggest elsewhere (and yeah, > 10-20 machines seems low for the size of the data -- you'll probably have > to start with a narrower time range than 3 months) > > Sorry if above is a bit of a rube goldberg machine. Just a suggestion. > > St.Ack > > > > > > On Thu, Sep 3, 2009 at 11:44 PM, Schubert Zhang <[email protected]> wrote: > > > @stack > > > > > > > > On Fri, Sep 4, 2009 at 7:34 AM, stack <[email protected]> wrote: > > > > > More questions: > > > > > > + What are the requirements regards the lag between receipt of data and > > it > > > showing in the system? > > > > One minute, ten minutes, an hour, or 24 hours? > > > > > > > As soon as possable. > > Delay of 10 or more minutes are acceptable. > > > > > > > + How many column families? > > > > > > > one is ok if using HBase. > > but I am selecting best solution now, > > > > > > > > > > St.Ack > > > > > > > > > On Wed, Sep 2, 2009 at 11:37 PM, Schubert Zhang <[email protected]> > > wrote: > > > > > > > in line. > > > > > > > > On Thu, Sep 3, 2009 at 12:46 PM, stack <[email protected]> wrote: > > > > > > > > > > > > > > How many machines can you use for this job? > > > > > > > > > > > > Tens. (e.g. 10~20) > > > > > > > > > > > > > > > > > > Do you need to keep it all? Does some data expire (or can it be > > moved > > > > > offline)? > > > > > > > > > Yes, we need remove old data which expire. > > > > > > > > > > > > > > > > > I see why you have timestamp as part of the key in your current > hbase > > > > > cluster -- i.e. tall tables -- as you have no other choice > currently. > > > > > > > > > > > > > It might make sense premaking the regions in the table. Look at > how > > > many > > > > > regions were made the day before and go ahead and premake them to > > save > > > > > yourself having to ride over splits (I can show you how to write a > > > little > > > > > script to do this). > > > > > > > > > > Does the time-series data arrive roughly on time -- e.g. all > > > instruments > > > > > emit the 4 o'clock readings at 4 o'clock or is there some flux in > > here? > > > > In > > > > > other words, do you have a write rate of thousands of updates per > > > second, > > > > > all carrying the same timestamp? > > > > > > > > > > > > The data will arrive with a minutes delay. > > > > Usually, we need to write/ingest tens of thousands of new rows. Many > > rows > > > > with the same timestamp. > > > > > > > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > Schubert > > > > > > > > > > > > On Thu, Sep 3, 2009 at 2:32 AM, Jonathan Gray <[email protected] > > > > > > wrote: > > > > > > > > > > > > > @Sylvain > > > > > > > > > > > > > > If you describe your use case, perhaps we can help you to > > > understand > > > > > what > > > > > > > others are doing / have done similarly. Event logging is > > certainly > > > > > > > something many of us have done. > > > > > > > > > > > > > > If you're wondering about how much load HBase can handle, > provide > > > > some > > > > > > > numbers of what you expect. How much data in bytes are > > associated > > > > with > > > > > > each > > > > > > > event, how many events per hour, and what operations do you > want > > to > > > > do > > > > > on > > > > > > > it? We could help you determine how big of a cluster you might > > > need > > > > > and > > > > > > the > > > > > > > kind of write/read throughput you might see. > > > > > > > > > > > > > > @Schubert > > > > > > > > > > > > > > You do not need to partition your tables by stamp. One > > possibility > > > > is > > > > > to > > > > > > > put the stamp as the first part of your rowkeys, and in that > way > > > you > > > > > will > > > > > > > have the table sorted by time. Using Scan's start/stop keys, > you > > > can > > > > > > > prevent doing a full table scan. > > > > > > > > > > > > > It would not work. Since our data comes fastly. In the method > only > > > one > > > > > > region(server) are busy for writing. The throughput is bad for > > > writing. > > > > > > > > > > > > > > > > > > > > > > > > > > For both of you... If you are storing massive amounts of > > streaming > > > > > > log-type > > > > > > > data, do you need full random read access to it? If you just > > need > > > to > > > > > > > process on subsets of time, that's easily partitioned by file. > > > HBase > > > > > > should > > > > > > > be used if you need to *read* from it randomly, not streaming. > > If > > > > you > > > > > > have > > > > > > > processing that HBase's inherent sorting, grouping, and > indexing > > > can > > > > > > benefit > > > > > > > from, then it also can make sense to use HBase in order to > avoid > > > > > > full-scans > > > > > > > of data. > > > > > > > > > > > > > > > > > > > I know it is a contradiction between random-access and batch > > > > processing. > > > > > > But > > > > > > the features of HBase(sorting, distributed b-tree, > > merge/compaction) > > > > are > > > > > > very attractive. > > > > > > > > > > > > > > > > > > > > > > > > > > HBase is not the answer because of lack of HDFS append. You > > could > > > > > buffer > > > > > > > in something outside HDFS, close files after a certain > size/time > > > > (this > > > > > > his > > > > > > > what hbase does now, we can have data loss because of no > > > > > > > appends as well), etc... > > > > > > > > > > > > > > Reads/writes of lots of streaming data to HBase will always be > > > slower > > > > > > than > > > > > > > HDFS. HBase adds additional buffering, and the > compaction/split > > > > > > processes > > > > > > > actually mean you copy the same data multiple times (probably > 3-4 > > > > times > > > > > > avg > > > > > > > which lines up with the 3-4x slowdown you see). > > > > > > > > > > > > > > > > > > > > > And there is currently a patch in development (that works at > > least > > > > > > > partially) to do direct-to-hdfs imports to HBase which would > then > > > be > > > > > > nearly > > > > > > > as fast as a normal HDFS writing job. > > > > > > > > > > > > > > Issue here: https://issues.apache.org/jira/browse/HBASE-48 > > > > > > > > > > > > > > > > > > > > > JG > > > > > > > > > > > > > > > > > > > > > Sylvain Hellegouarch wrote: > > > > > > > > > > > > > >> I must admit, I'm left as puzzled as you are. Our current use > > case > > > > at > > > > > > work > > > > > > >> involve large amount of small event log writing. Of course > HDFS > > > was > > > > > > quickly > > > > > > >> out of question since it's not there yet to append to a file > and > > > > more > > > > > > >> generally to handle large amount of small write ops. > > > > > > >> > > > > > > >> So we decided with HBase because we trust the Hadoop/HBase > > > > > > infrastructure > > > > > > >> will offer us the robustness and reliability we need. That > being > > > > said, > > > > > > I'm > > > > > > >> not feeling at ease in regards to the capacity of HBase to > > handle > > > > the > > > > > > >> potential load we are looking at inputing. > > > > > > >> > > > > > > >> In fact, it's a common treat of such systems, they've been > > > designed > > > > > with > > > > > > a > > > > > > >> certain use case in mind and sometimes I feel like their > design > > > and > > > > > > >> implementation leak way too much on our infrastructure, > leading > > us > > > > > down > > > > > > the > > > > > > >> path of a virtual lock-in. > > > > > > >> > > > > > > >> Now I am not accusing anyone here, just observing that I find > it > > > > > really > > > > > > >> hard to locate any industrial story of those systems in a > > similar > > > > use > > > > > > case > > > > > > >> we have at hand. > > > > > > >> > > > > > > >> The number of nodes this or that company has doesn't quite > > > interest > > > > me > > > > > > as > > > > > > >> much as the way they are actually using HBase and Hadoop. > > > > > > >> > > > > > > >> RDBMS don't scale as well but they've got a long history and > > > people > > > > do > > > > > > >> know how to optimise, use and manage them. It seems > > > column-oriented > > > > > > database > > > > > > >> systems are still young :) > > > > > > >> > > > > > > >> - Sylvain > > > > > > >> > > > > > > >> Schubert Zhang a écrit : > > > > > > >> > > > > > > >>> Regardless Cassandra, I want to discuss some questions about > > > > > > >>> HBase/Bigtable. Any advices are expected. > > > > > > >>> > > > > > > >>> Regards runing MapReduce to scan/analyze big data in HBase. > > > > > > >>> > > > > > > >>> Compared to sequentially reading data from HDFS files > directly, > > > > > > >>> scan/sequential-reading data from HBase is slower. (As my > test, > > > at > > > > > > least > > > > > > >>> 3:1 > > > > > > >>> or 4:1). > > > > > > >>> > > > > > > >>> For the data in HBase, it is diffcult to only analyze > specified > > > > part > > > > > of > > > > > > >>> data. For example, it is diffcult to only analyze the recent > > one > > > > day > > > > > of > > > > > > >>> data. In my application, I am considering partition data into > > > > > different > > > > > > >>> HBase tables (e.g. one day - one table), then, I can only > touch > > > one > > > > > > table > > > > > > >>> for analyze via MapReduce. > > > > > > >>> In Google's Bigtable paper, in the "8.1 Google Analytics", > they > > > > also > > > > > > >>> discribe this usage, but I don't know how. > > > > > > >>> > > > > > > >>> It is also slower to put flooding data into HBase table than > > > > writing > > > > > to > > > > > > >>> files. (As my test, at least 3:1 or 4:1 too). So, maybe in > the > > > > > future, > > > > > > >>> HBase > > > > > > >>> can provide a bulk-load feature, like PNUTS? > > > > > > >>> > > > > > > >>> Many people suggest us to only store metadata into HBase > > tables, > > > > and > > > > > > >>> leave > > > > > > >>> data in HDFS files, because our time-series dataset is very > > big. > > > I > > > > > > >>> understand this idea make sense for some simple application > > > > > > requirements. > > > > > > >>> But usually, I want different indexes to the raw data. It is > > > > diffcult > > > > > > to > > > > > > >>> build such indexes if the the raw data files (which are raw > or > > > are > > > > > > >>> reconstructed via MapReduce periodically on recent data ) > are > > > not > > > > > > >>> totally > > > > > > >>> sorted. .... HBase can provide us many expected features: > > > sorted, > > > > > > >>> distributed b-tree, compact/merge. > > > > > > >>> > > > > > > >>> So, it is very difficult for me to make trade-off. > > > > > > >>> If I store data in HDFS files (may be partitioned), and > > > > > metadata/index > > > > > > in > > > > > > >>> HBase. The metadata/index is very difficult to be build. > > > > > > >>> If I rely on HBase totally, the performance of ingesting-data > > and > > > > > > >>> scaning-data is not good. Is it reasonable to do MapReduce on > > > > HBase? > > > > > We > > > > > > >>> know > > > > > > >>> the goal of HBase is to provide random access over HDFS, and > it > > > is > > > > a > > > > > > >>> extention or adaptor over HDFS. > > > > > > >>> > > > > > > >>> ---- > > > > > > >>> Many a time, I am thinking, maybe we need a data storage > > engine, > > > > > which > > > > > > >>> need > > > > > > >>> not so strong consistency, and it can provide better writing > > and > > > > > > >>> reading throughput like HDFS. Maybe, we can design another > > system > > > > > like > > > > > > a > > > > > > >>> simpler HBase ? > > > > > > >>> > > > > > > >>> Schubert > > > > > > >>> > > > > > > >>> On Wed, Sep 2, 2009 at 8:56 AM, Andrew Purtell < > > > > [email protected]> > > > > > > >>> wrote: > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>>> To be precise, S3. > > > http://status.aws.amazon.com/s3-20080720.html > > > > > > >>>> > > > > > > >>>> - Andy > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> ________________________________ > > > > > > >>>> From: Andrew Purtell <[email protected]> > > > > > > >>>> To: [email protected] > > > > > > >>>> Sent: Tuesday, September 1, 2009 5:53:09 PM > > > > > > >>>> Subject: Re: Cassandra vs HBase > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> Right... I recall an incident in AWS where a malformed > gossip > > > > packet > > > > > > >>>> took > > > > > > >>>> down all of Dynamo. Seems that even P2P doesn't mitigate > > against > > > > > > corner > > > > > > >>>> cases. > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> On Tue, Sep 1, 2009 at 3:12 PM, Jonathan Ellis < > > > [email protected] > > > > > > > > > > > >>>> wrote: > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>>> The big win for Cassandra is that its p2p distribution > model > > -- > > > > > which > > > > > > >>>>> drives the consistency model -- means there is no single > > point > > > of > > > > > > >>>>> failure. SPF can be mitigated by failover but it's really, > > > > really > > > > > > >>>>> hard to get all the corner cases right with that approach. > > > Even > > > > > > >>>>> Google with their 3 year head start and huge engineering > > > > resources > > > > > > >>>>> still has trouble with that occasionally. (See e.g. > > > > > > >>>>> > > > > > > http://groups.google.com/group/google-appengine/msg/ba95ded980c8c179 > > > > > > .) > > > > > > >>>>> > > > > > > >>>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >
