Re: Time-series data problem (WAS -> Cassandra vs HBase)

Schubert Zhang Fri, 04 Sep 2009 10:39:16 -0700

@ stack

Thank you very much. Your idea is very good.
Yes, the monotonically increasing rowkeys are very appropriate to use
bulk-load tools to create new regions from bottom to up. I would study
HBASE-48 and do these experiments.


This idea should be based on strict monotonically increasing rowkeys. Is it
right?

Schubert


On Sat, Sep 5, 2009 at 1:16 AM, Andrew Purtell <[email protected]> wrote:

> Clean this up a bit and smooth the rough edges and this is I think a valid
> strategy for handling a continuous high volume import of sequential data
> with monotonically increasing row keys generated by some global counter, NTP
> synchronized System.currentTimeMillis() for example. To support this
> especially when the functions of META are relocated to ZK we should add the
> necessary APIs under HBaseAdmin.
>
> > If by chance they have edits that don't fit the tables current time
> range, they'll fail to go in because first and last regions are read-only.
>
> I'm not sure I understand that phrasing. If the script modifies the
> left-most and right-most region (in terms of the key space) to have the
> empty row as start and end key respectively, won't any edits to the left or
> right of the tables current time range go in?
>
>    - Andy
>
>
>
>
> ________________________________
> From: stack <[email protected]>
> To: [email protected]
> Sent: Friday, September 4, 2009 9:50:33 AM
> Subject: Time-series data problem (WAS -> Cassandra vs HBase)
>
> Ok.  Thanks for answering the questions.
>
> Could you do something like the below?
>
> + One table keyed by timestamp with an additional qualifier of some sort --
> random if you have to -- so each event is a row of its own or a column of
> its own inside a row keyed by ts.
> + In such a table, new data would always get added to the end of the table;
> old data would be removed from the top of the table.
> + Catch the incoming events in hdfs.  They will be unordered.  Catch them
> into sequence files (Would 'scribe' help here?)
> + Study hbase-48.  See how it works.  Its a MR job that writes regions each
> of which has a single hfile of close to maximum regionsize.
> + Run a MR job every 10 minutes that takes all events that have arrived
> since the last MR job, sorts them, passes them through the hbase-48 sorting
> reducer and in turn out to the hfileoutputformat writer writing regions.
> + When the MR job completes, trigger a version of the script that is in
> hbase-48 for adding the new regions.  It will move into place and add to
> .META. regions that have been made by the last MR job.
> + On the next run of the meta scanner, it runs every minute, the new
> regions
> will show in the new table.
> + The first and last regions in the table would be managed by an external
> script.  Aging out regions would just be a matter of deleting mention of
> regions from .META. and removing their content from hdfs.  The same script
> would be responsible for updating the first region in the .META. table
> moving its end row to encompass the start row of the new 'first' row in the
> table.  Some similar juggling would have to be done around the 'last' entry
> in .META. -- every time you added regions, the last .META. entry would need
> adjusting.  These regions would need to be read-only (You can set flag on
> these regions from the script) so they didn't take on any writes.
>
> The MR job writing hfiles directly should be a good bit faster than hbase
> taking the edits directly.
>
> If a hiccup in the system, the spill to hdfs will make it so you don't lose
> events.  The next MR job that runs will just take a while.
>
> The batch updates that arrive by other channels that you mention in another
> mail should go in fine... If by chance they have edits that don't fit the
> tables current time range, they'll fail to go in because first and last
> regions are read-only.
>
> You should run with bigger regions as the lads suggest elsewhere (and yeah,
> 10-20  machines seems low for the size of the data -- you'll probably have
> to start with a  narrower time range than 3 months)
>
> Sorry if above is a bit of a rube goldberg machine.  Just a suggestion.
>
> St.Ack
>
>
>
>
>
> On Thu, Sep 3, 2009 at 11:44 PM, Schubert Zhang <[email protected]> wrote:
>
> > @stack
> >
> >
> >
> > On Fri, Sep 4, 2009 at 7:34 AM, stack <[email protected]> wrote:
> >
> > > More questions:
> > >
> > > + What are the requirements regards the lag between receipt of data and
> > it
> > > showing in the system?
> >
> > One minute, ten minutes, an hour, or 24 hours?
> > >
> >
> >  As soon as possable.
> > Delay of 10 or more minutes are acceptable.
> >
> >
> > > + How many column families?
> > >
> >
> > one is ok if using HBase.
> > but I am selecting best solution now,
> >
> >
> > >
> > > St.Ack
> > >
> > >
> > > On Wed, Sep 2, 2009 at 11:37 PM, Schubert Zhang <[email protected]>
> > wrote:
> > >
> > > > in line.
> > > >
> > > > On Thu, Sep 3, 2009 at 12:46 PM, stack <[email protected]> wrote:
> > > >
> > > > >
> > > > > How many machines can you use for this job?
> > > >
> > > >
> > > > Tens.  (e.g. 10~20)
> > > >
> > > >
> > > > >
> > > > > Do you need to keep it all?  Does some data expire (or can it be
> > moved
> > > > > offline)?
> > > > >
> > > > Yes, we need remove old data which expire.
> > > >
> > > >
> > > >
> > > > > I see why you have timestamp as part of the key in your current
> hbase
> > > > > cluster -- i.e. tall tables -- as you have no other choice
> currently.
> > > >
> > > >
> > > > > It might make sense premaking the regions in the table.  Look at
> how
> > > many
> > > > > regions were made the day before and go ahead and premake them to
> > save
> > > > > yourself having to ride over splits (I can show you how to write a
> > > little
> > > > > script to do this).
> > > > >
> > > > > Does the time-series data arrive roughly on time -- e.g. all
> > > instruments
> > > > > emit the 4 o'clock readings at 4 o'clock or is there some flux in
> > here?
> > > >  In
> > > > > other words, do you have a write rate of thousands of updates per
> > > second,
> > > > > all carrying the same timestamp?
> > > >
> > > >
> > > > The data will arrive with a minutes delay.
> > > > Usually, we need to write/ingest tens of thousands of new rows. Many
> > rows
> > > > with the same timestamp.
> > > >
> > > >
> > > > >
> > > > > St.Ack
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > Schubert
> > > > > >
> > > > > > On Thu, Sep 3, 2009 at 2:32 AM, Jonathan Gray <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > @Sylvain
> > > > > > >
> > > > > > > If you describe your use case, perhaps we can help you to
> > > understand
> > > > > what
> > > > > > > others are doing / have done similarly.  Event logging is
> > certainly
> > > > > > > something many of us have done.
> > > > > > >
> > > > > > > If you're wondering about how much load HBase can handle,
> provide
> > > > some
> > > > > > > numbers of what you expect.  How much data in bytes are
> > associated
> > > > with
> > > > > > each
> > > > > > > event, how many events per hour, and what operations do you
> want
> > to
> > > > do
> > > > > on
> > > > > > > it?  We could help you determine how big of a cluster you might
> > > need
> > > > > and
> > > > > > the
> > > > > > > kind of write/read throughput you might see.
> > > > > > >
> > > > > > > @Schubert
> > > > > > >
> > > > > > > You do not need to partition your tables by stamp.  One
> > possibility
> > > > is
> > > > > to
> > > > > > > put the stamp as the first part of your rowkeys, and in that
> way
> > > you
> > > > > will
> > > > > > > have the table sorted by time.  Using Scan's start/stop keys,
> you
> > > can
> > > > > > > prevent doing a full table scan.
> > > > > > >
> > > > > > It would not work. Since our data comes fastly. In the method
> only
> > > one
> > > > > > region(server) are busy for writing. The throughput is bad for
> > > writing.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > For both of you... If you are storing massive amounts of
> > streaming
> > > > > > log-type
> > > > > > > data, do you need full random read access to it?  If you just
> > need
> > > to
> > > > > > > process on subsets of time, that's easily partitioned by file.
> > > HBase
> > > > > > should
> > > > > > > be used if you need to *read* from it randomly, not streaming.
> >  If
> > > > you
> > > > > > have
> > > > > > > processing that HBase's inherent sorting, grouping, and
> indexing
> > > can
> > > > > > benefit
> > > > > > > from, then it also can make sense to use HBase in order to
> avoid
> > > > > > full-scans
> > > > > > > of data.
> > > > > > >
> > > > > >
> > > > > > I know it is a contradiction between random-access and batch
> > > > processing.
> > > > > > But
> > > > > > the features of HBase(sorting, distributed b-tree,
> > merge/compaction)
> > > > are
> > > > > > very attractive.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > HBase is not the answer because of lack of HDFS append.  You
> > could
> > > > > buffer
> > > > > > > in something outside HDFS, close files after a certain
> size/time
> > > > (this
> > > > > > his
> > > > > > > what hbase does now, we can have data loss because of no
> > > > > > > appends as well), etc...
> > > > > > >
> > > > > > > Reads/writes of lots of streaming data to HBase will always be
> > > slower
> > > > > > than
> > > > > > > HDFS.  HBase adds additional buffering, and the
> compaction/split
> > > > > > processes
> > > > > > > actually mean you copy the same data multiple times (probably
> 3-4
> > > > times
> > > > > > avg
> > > > > > > which lines up with the 3-4x slowdown you see).
> > > > > > >
> > > > > > >
> > > > > > > And there is currently a patch in development (that works at
> > least
> > > > > > > partially) to do direct-to-hdfs imports to HBase which would
> then
> > > be
> > > > > > nearly
> > > > > > > as fast as a normal HDFS writing job.
> > > > > > >
> > > > > > > Issue here:  https://issues.apache.org/jira/browse/HBASE-48
> > > > > > >
> > > > > > >
> > > > > > > JG
> > > > > > >
> > > > > > >
> > > > > > > Sylvain Hellegouarch wrote:
> > > > > > >
> > > > > > >> I must admit, I'm left as puzzled as you are. Our current use
> > case
> > > > at
> > > > > > work
> > > > > > >> involve large amount of small event log writing. Of course
> HDFS
> > > was
> > > > > > quickly
> > > > > > >> out of question since it's not there yet to append to a file
> and
> > > > more
> > > > > > >> generally to handle large amount of small write ops.
> > > > > > >>
> > > > > > >> So we decided with HBase because we trust the Hadoop/HBase
> > > > > > infrastructure
> > > > > > >> will offer us the robustness and reliability we need. That
> being
> > > > said,
> > > > > > I'm
> > > > > > >> not feeling at ease in regards to the capacity of HBase to
> > handle
> > > > the
> > > > > > >> potential load we are looking at inputing.
> > > > > > >>
> > > > > > >> In fact, it's a common treat of such systems, they've been
> > > designed
> > > > > with
> > > > > > a
> > > > > > >> certain use case in mind and sometimes I feel like their
> design
> > > and
> > > > > > >> implementation leak way too much on our infrastructure,
> leading
> > us
> > > > > down
> > > > > > the
> > > > > > >> path of a virtual lock-in.
> > > > > > >>
> > > > > > >> Now I am not accusing anyone here, just observing that I find
> it
> > > > > really
> > > > > > >> hard to locate any industrial story of those systems in a
> > similar
> > > > use
> > > > > > case
> > > > > > >> we have at hand.
> > > > > > >>
> > > > > > >> The number of nodes this or that company has doesn't quite
> > > interest
> > > > me
> > > > > > as
> > > > > > >> much as the way they are actually using HBase and Hadoop.
> > > > > > >>
> > > > > > >> RDBMS don't scale as well but they've got a long history and
> > > people
> > > > do
> > > > > > >> know how to optimise, use and manage them. It seems
> > > column-oriented
> > > > > > database
> > > > > > >> systems are still young :)
> > > > > > >>
> > > > > > >> - Sylvain
> > > > > > >>
> > > > > > >> Schubert Zhang a écrit :
> > > > > > >>
> > > > > > >>> Regardless Cassandra, I want to discuss some questions about
> > > > > > >>> HBase/Bigtable.  Any advices are expected.
> > > > > > >>>
> > > > > > >>> Regards runing MapReduce to scan/analyze big data in HBase.
> > > > > > >>>
> > > > > > >>> Compared to sequentially reading data from HDFS files
> directly,
> > > > > > >>> scan/sequential-reading data from HBase is slower. (As my
> test,
> > > at
> > > > > > least
> > > > > > >>> 3:1
> > > > > > >>> or 4:1).
> > > > > > >>>
> > > > > > >>> For the data in HBase, it is diffcult to only analyze
> specified
> > > > part
> > > > > of
> > > > > > >>> data. For example, it is diffcult to only analyze the recent
> > one
> > > > day
> > > > > of
> > > > > > >>> data. In my application, I am considering partition data into
> > > > > different
> > > > > > >>> HBase tables (e.g. one day - one table), then, I can only
> touch
> > > one
> > > > > > table
> > > > > > >>> for analyze via MapReduce.
> > > > > > >>> In Google's Bigtable paper, in the "8.1 Google Analytics",
> they
> > > > also
> > > > > > >>> discribe this usage, but I don't know how.
> > > > > > >>>
> > > > > > >>> It is also slower to put flooding data into HBase table than
> > > > writing
> > > > > to
> > > > > > >>> files. (As my test, at least 3:1 or 4:1 too). So, maybe in
> the
> > > > > future,
> > > > > > >>> HBase
> > > > > > >>> can provide a bulk-load feature, like PNUTS?
> > > > > > >>>
> > > > > > >>> Many people suggest us to only store metadata into HBase
> > tables,
> > > > and
> > > > > > >>> leave
> > > > > > >>> data in HDFS files, because our time-series dataset is very
> > big.
> > >  I
> > > > > > >>> understand this idea make sense for some simple application
> > > > > > requirements.
> > > > > > >>> But usually, I want different indexes to the raw data. It is
> > > > diffcult
> > > > > > to
> > > > > > >>> build such indexes if the the raw data files (which are raw
> or
> > > are
> > > > > > >>> reconstructed via MapReduce  periodically on recent data )
> are
> > > not
> > > > > > >>> totally
> > > > > > >>> sorted.  .... HBase can provide us many expected features:
> > > sorted,
> > > > > > >>> distributed b-tree, compact/merge.
> > > > > > >>>
> > > > > > >>> So, it is very difficult for me to make trade-off.
> > > > > > >>> If I store data in HDFS files (may be partitioned), and
> > > > > metadata/index
> > > > > > in
> > > > > > >>> HBase. The metadata/index is very difficult to be build.
> > > > > > >>> If I rely on HBase totally, the performance of ingesting-data
> > and
> > > > > > >>> scaning-data is not good. Is it reasonable to do MapReduce on
> > > > HBase?
> > > > > We
> > > > > > >>> know
> > > > > > >>> the goal of HBase is to provide random access over HDFS, and
> it
> > > is
> > > > a
> > > > > > >>> extention or adaptor over HDFS.
> > > > > > >>>
> > > > > > >>> ----
> > > > > > >>> Many a time, I am thinking, maybe we need a data storage
> > engine,
> > > > > which
> > > > > > >>> need
> > > > > > >>> not so strong consistency, and it can provide better writing
> > and
> > > > > > >>> reading throughput like HDFS. Maybe, we can design another
> > system
> > > > > like
> > > > > > a
> > > > > > >>> simpler HBase ?
> > > > > > >>>
> > > > > > >>> Schubert
> > > > > > >>>
> > > > > > >>> On Wed, Sep 2, 2009 at 8:56 AM, Andrew Purtell <
> > > > [email protected]>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>> To be precise, S3.
> > > http://status.aws.amazon.com/s3-20080720.html
> > > > > > >>>>
> > > > > > >>>>  - Andy
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> ________________________________
> > > > > > >>>> From: Andrew Purtell <[email protected]>
> > > > > > >>>> To: [email protected]
> > > > > > >>>> Sent: Tuesday, September 1, 2009 5:53:09 PM
> > > > > > >>>> Subject: Re: Cassandra vs HBase
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> Right... I recall an incident in AWS where a malformed
> gossip
> > > > packet
> > > > > > >>>> took
> > > > > > >>>> down all of Dynamo. Seems that even P2P doesn't mitigate
> > against
> > > > > > corner
> > > > > > >>>> cases.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Tue, Sep 1, 2009 at 3:12 PM, Jonathan Ellis <
> > > [email protected]
> > > > >
> > > > > > >>>> wrote:
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>> The big win for Cassandra is that its p2p distribution
> model
> > --
> > > > > which
> > > > > > >>>>> drives the consistency model -- means there is no single
> > point
> > > of
> > > > > > >>>>> failure.  SPF can be mitigated by failover but it's really,
> > > > really
> > > > > > >>>>> hard to get all the corner cases right with that approach.
> > >  Even
> > > > > > >>>>> Google with their 3 year head start and huge engineering
> > > > resources
> > > > > > >>>>> still has trouble with that occasionally.  (See e.g.
> > > > > > >>>>>
> > > > >
> http://groups.google.com/group/google-appengine/msg/ba95ded980c8c179
> > > > > > .)
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
>
>

Re: Time-series data problem (WAS -> Cassandra vs HBase)

Reply via email to