Re: Getting started - sharding data by customer, and hadoop version requirements.

Aaron McCurry Fri, 21 Dec 2012 09:30:57 -0800

One idea that I have be toying with, is removing the meta data about blur
tables from ZooKeeper and storing it in HDFS directly.  Tens of thousands
of tables might be difficult to support they way everything is currently
implemented.  You could run several shard clusters and have your tables
evenly spread across each shard cluster.  The controllers can make all the
clusters appear like one table, however this has never been tried at that
kind of scale.  I agree with Garret's suggestions about grouping indexes
together.


Aaron


On Fri, Dec 21, 2012 at 11:12 AM, Garrett Barton
<[email protected]>wrote:

> Tens of thousands eh?  I've had ~100-150 running and that worked fine.  I
> could see issues with Blurs table tracking since its zookeeper backed, and
> zk doesn't like massive directories like that.  Then again Blur has a
> caching system built into it for its meta data, so maybe it would be ok?
>
> Are the table structures going to be different?  Is there any reasonable
> grouping you could do of the customers? Perhaps the small ones could live
> together in a larger index?
>
>
>
> On Fri, Dec 21, 2012 at 11:08 AM, Aaron McCurry <[email protected]>
> wrote:
>
> > I agree with Garret.  We run ~100 tables with the shard count varying
> from
> > 1 shard to over 1000 in a single table.  How many tables will you have?
> >
> > Yes Blur works on CDH3U2.  It should work on any 0.20.x (1.0.x) version
> of
> > Hadoop.  However if HDFS doesn't support appends then the write ahead log
> > won't function correctly.  Meaning it won't actually preserve the data.
> >
> > Aaron
> >
> >
> > On Fri, Dec 21, 2012 at 10:59 AM, Garrett Barton
> > <[email protected]>wrote:
> >
> > > If I understand you correctly you have data from multiple customers
> > > (denoted by a customer_id) and you only perform a search against a
> single
> > > customer at a time?  If that's the case the separate index route might
> > be a
> > > good idea as you can rebuild them separately, and you can model them
> > > differently potentially if you have a need.  Having said that, if you
> > also
> > > occasionally want to search across customers, then you would want them
> > all
> > > in a single index.
> > >
> > > I have Blur 1.x running on CDH3U5, I think it will work back down to
> > CDH3U2
> > > at least, and that's hadoop 0.20 in both cases.  Have not tried 0.23
> > though
> > > I will be needing to soon.
> > >
> > >
> > > On Fri, Dec 21, 2012 at 10:51 AM, James Kebinger <[email protected]
> > > >wrote:
> > >
> > > > Hello, I'm hoping to kick the tires on apache blur in the near
> future.
> > I
> > > > have a couple of quick questions before I set out.
> > > >
> > > > What version(s) of hadoop are required/supported at present?
> > > >
> > > > We have lots of data to index, but we always search within a
> particular
> > > > customer's data set. Would the best practice be to put all of the
> data
> > in
> > > > one table and have the customer id in all of the queries, or build
> > > separate
> > > > tables for each customer_id (like users-1, users-123 etc).
> > > >
> > > > Thanks, and happy holidays!
> > > >
> > > > -James Kebinger
> > > >
> > >
> >
>

Re: Getting started - sharding data by customer, and hadoop version requirements.

Reply via email to