Re: Getting started - sharding data by customer, and hadoop version requirements.

James Kebinger Fri, 21 Dec 2012 10:23:02 -0800

Grouping some smaller customers together may be workable.

Could we hack around zk's limitations in the same way as file systems -
with a tree of /p/r/e/f/prefixes so that no ZNode has too many children?



On Fri, Dec 21, 2012 at 11:12 AM, Garrett Barton
<[email protected]>wrote:

> Tens of thousands eh?  I've had ~100-150 running and that worked fine.  I
> could see issues with Blurs table tracking since its zookeeper backed, and
> zk doesn't like massive directories like that.  Then again Blur has a
> caching system built into it for its meta data, so maybe it would be ok?
>
> Are the table structures going to be different?  Is there any reasonable
> grouping you could do of the customers? Perhaps the small ones could live
> together in a larger index?
>
>
>
> On Fri, Dec 21, 2012 at 11:08 AM, Aaron McCurry <[email protected]>
> wrote:
>
> > I agree with Garret.  We run ~100 tables with the shard count varying
> from
> > 1 shard to over 1000 in a single table.  How many tables will you have?
> >
> > Yes Blur works on CDH3U2.  It should work on any 0.20.x (1.0.x) version
> of
> > Hadoop.  However if HDFS doesn't support appends then the write ahead log
> > won't function correctly.  Meaning it won't actually preserve the data.
> >
> > Aaron
> >
> >
> > On Fri, Dec 21, 2012 at 10:59 AM, Garrett Barton
> > <[email protected]>wrote:
> >
> > > If I understand you correctly you have data from multiple customers
> > > (denoted by a customer_id) and you only perform a search against a
> single
> > > customer at a time?  If that's the case the separate index route might
> > be a
> > > good idea as you can rebuild them separately, and you can model them
> > > differently potentially if you have a need.  Having said that, if you
> > also
> > > occasionally want to search across customers, then you would want them
> > all
> > > in a single index.
> > >
> > > I have Blur 1.x running on CDH3U5, I think it will work back down to
> > CDH3U2
> > > at least, and that's hadoop 0.20 in both cases.  Have not tried 0.23
> > though
> > > I will be needing to soon.
> > >
> > >
> > > On Fri, Dec 21, 2012 at 10:51 AM, James Kebinger <[email protected]
> > > >wrote:
> > >
> > > > Hello, I'm hoping to kick the tires on apache blur in the near
> future.
> > I
> > > > have a couple of quick questions before I set out.
> > > >
> > > > What version(s) of hadoop are required/supported at present?
> > > >
> > > > We have lots of data to index, but we always search within a
> particular
> > > > customer's data set. Would the best practice be to put all of the
> data
> > in
> > > > one table and have the customer id in all of the queries, or build
> > > separate
> > > > tables for each customer_id (like users-1, users-123 etc).
> > > >
> > > > Thanks, and happy holidays!
> > > >
> > > > -James Kebinger
> > > >
> > >
> >
>

Re: Getting started - sharding data by customer, and hadoop version requirements.

Reply via email to