Re: Experience with indexing billions of documents?

2010-04-14 Thread Jason Rutherglen
Tom,

Yes, we've (Biz360) indexed 3 billion and upwards... If indexing
is the issue (or rather re-indexing) we used SOLR-1301 with
Hadoop to re-index efficiently (ie, in a timely manner). For
querying we're currently using the out of the box Solr
distributed shards query mechanism, which is hard (read, near
impossible) to customize. I've been writing SOLR-1724 which
deploy cores out of HDFS. SOLR-1724 works in conjunction with
Solr Cloud which should allow for more efficient failover.  Katta
has a nice model for replicating cores across multiple servers
for redundancy. The issue with this is, it could feasibly
require 2 times as many servers for 2 times replication.

If you have more questions feel free to ping me or whatever.

Cheers,

Jason

On Fri, Apr 2, 2010 at 8:57 AM, Burton-West, Tom  wrote:
> We are currently indexing 5 million books in Solr, scaling up over the next 
> few years to 20 million.  However we are using the entire book as a Solr 
> document.  We are evaluating the possibility of indexing individual pages as 
> there are some use cases where users want the most relevant pages regardless 
> of what book they occur in.  However, we estimate that we are talking about 
> somewhere between 1 and 6 billion pages and have concerns over whether Solr 
> will scale to this level.
>
> Does anyone have experience using Solr with 1-6 billion Solr documents?
>
> The lucene file format document 
> (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)  mentions 
> a limit of about 2 billion document ids.   I assume this is the lucene 
> internal document id and would therefore be a per index/per shard limit.  Is 
> this correct?
>
>
> Tom Burton-West.
>
>
>
>


Re: Experience with indexing billions of documents?

2010-04-13 Thread Thomas Koch
Bradford Stephens:
> Hey there,
> 
> We've actually been tackling this problem at Drawn to Scale. We'd really
> like to get our hands on LuceHBase to see how it scales. Our faceting still
> needs to be done in-memory, which is kinda tricky, but it's worth
> exploring.
Hi Bradford,

thank you for your interest. Just yesterday I found out, that somebody else 
did apparently exactly the same as I did, porting lucandra to HBase:

http://github.com/akkumar/hbasene

I'll have a look at this project and most likely abandon luceHBase in favor of 
the other, since it's more advanced.

Best regards,

Thomas Koch, http://www.koch.ro


Re: Experience with indexing billions of documents?

2010-04-13 Thread Bradford Stephens
Hey there,

We've actually been tackling this problem at Drawn to Scale. We'd really
like to get our hands on LuceHBase to see how it scales. Our faceting still
needs to be done in-memory, which is kinda tricky, but it's worth
exploring.

On Mon, Apr 12, 2010 at 7:27 AM, Thomas Koch  wrote:

> Hi,
>
> could I interest you in this project?
> http://github.com/thkoch2001/lucehbase
>
> The aim is to store the index directly in HBase, a database system modelled
> after google's Bigtable to store data in the regions of tera or petabytes.
>
> Best regards, Thomas Koch
>
> Lance Norskog:
> > The 2B limitation is within one shard, due to using a signed 32-bit
> > integer. There is no limit in that regard in sharding- Distributed
> > Search uses the stored unique document id rather than the internal
> > docid.
> >
> > On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens 
> wrote:
> > > A colleague of mine is using native Lucene + some home-grown
> > > patches/optimizations to index over 13B small documents in a 32-shard
> > > environment, which is around 406M docs per shard.
> > >
> > > If there's a 2B doc id limitation in Lucene then I assume he's patched
> it
> > > himself.
> > >
> > > On Fri, Apr 2, 2010 at 1:17 PM,  wrote:
> > >> My guess is that you will need to take advantage of Solr 1.5's
> upcoming
> > >> cloud/cluster renovations and use multiple indexes to comfortably
> > >> achieve those numbers. Hypthetically, in that case, you won't be
> limited
> > >> by single index docid limitations of Lucene.
> > >>
> > >> > We are currently indexing 5 million books in Solr, scaling up over
> the
> > >> > next few years to 20 million.  However we are using the entire book
> as
> > >> > a Solr document.  We are evaluating the possibility of indexing
> > >> > individual pages as there are some use cases where users want the
> most
> > >> > relevant
> > >>
> > >> pages
> > >>
> > >> > regardless of what book they occur in.  However, we estimate that we
> > >> > are talking about somewhere between 1 and 6 billion pages and have
> > >> > concerns over whether Solr will scale to this level.
> > >> >
> > >> > Does anyone have experience using Solr with 1-6 billion Solr
> > >> > documents?
> > >> >
> > >> > The lucene file format document
> > >> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> > >> > mentions a limit of about 2 billion document ids.   I assume this is
> > >> > the lucene internal document id and would therefore be a per
> index/per
> > >> > shard limit.  Is this correct?
> > >> >
> > >> >
> > >> > Tom Burton-West.
> >
>
> Thomas Koch, http://www.koch.ro
>



-- 
Bradford Stephens,
Founder, Drawn to Scale
drawntoscalehq.com
727.697.7528

http://www.drawntoscalehq.com --  The intuitive, cloud-scale data solution.
Process, store, query, search, and serve all your data.

http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
and Computer Science


Re: Experience with indexing billions of documents?

2010-04-12 Thread Thomas Koch
Hi,

could I interest you in this project?
http://github.com/thkoch2001/lucehbase

The aim is to store the index directly in HBase, a database system modelled 
after google's Bigtable to store data in the regions of tera or petabytes.

Best regards, Thomas Koch

Lance Norskog:
> The 2B limitation is within one shard, due to using a signed 32-bit
> integer. There is no limit in that regard in sharding- Distributed
> Search uses the stored unique document id rather than the internal
> docid.
> 
> On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens  wrote:
> > A colleague of mine is using native Lucene + some home-grown
> > patches/optimizations to index over 13B small documents in a 32-shard
> > environment, which is around 406M docs per shard.
> >
> > If there's a 2B doc id limitation in Lucene then I assume he's patched it
> > himself.
> >
> > On Fri, Apr 2, 2010 at 1:17 PM,  wrote:
> >> My guess is that you will need to take advantage of Solr 1.5's upcoming
> >> cloud/cluster renovations and use multiple indexes to comfortably
> >> achieve those numbers. Hypthetically, in that case, you won't be limited
> >> by single index docid limitations of Lucene.
> >>
> >> > We are currently indexing 5 million books in Solr, scaling up over the
> >> > next few years to 20 million.  However we are using the entire book as
> >> > a Solr document.  We are evaluating the possibility of indexing
> >> > individual pages as there are some use cases where users want the most
> >> > relevant
> >>
> >> pages
> >>
> >> > regardless of what book they occur in.  However, we estimate that we
> >> > are talking about somewhere between 1 and 6 billion pages and have
> >> > concerns over whether Solr will scale to this level.
> >> >
> >> > Does anyone have experience using Solr with 1-6 billion Solr
> >> > documents?
> >> >
> >> > The lucene file format document
> >> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> >> > mentions a limit of about 2 billion document ids.   I assume this is
> >> > the lucene internal document id and would therefore be a per index/per
> >> > shard limit.  Is this correct?
> >> >
> >> >
> >> > Tom Burton-West.
> 

Thomas Koch, http://www.koch.ro


Re: Experience with indexing billions of documents?

2010-04-05 Thread Lance Norskog
The 2B limitation is within one shard, due to using a signed 32-bit
integer. There is no limit in that regard in sharding- Distributed
Search uses the stored unique document id rather than the internal
docid.

On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens  wrote:
> A colleague of mine is using native Lucene + some home-grown
> patches/optimizations to index over 13B small documents in a 32-shard
> environment, which is around 406M docs per shard.
>
> If there's a 2B doc id limitation in Lucene then I assume he's patched it
> himself.
>
> On Fri, Apr 2, 2010 at 1:17 PM,  wrote:
>
>> My guess is that you will need to take advantage of Solr 1.5's upcoming
>> cloud/cluster renovations and use multiple indexes to comfortably achieve
>> those numbers. Hypthetically, in that case, you won't be limited by single
>> index docid limitations of Lucene.
>>
>> > We are currently indexing 5 million books in Solr, scaling up over the
>> > next few years to 20 million.  However we are using the entire book as a
>> > Solr document.  We are evaluating the possibility of indexing individual
>> > pages as there are some use cases where users want the most relevant
>> pages
>> > regardless of what book they occur in.  However, we estimate that we are
>> > talking about somewhere between 1 and 6 billion pages and have concerns
>> > over whether Solr will scale to this level.
>> >
>> > Does anyone have experience using Solr with 1-6 billion Solr documents?
>> >
>> > The lucene file format document
>> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
>> > mentions a limit of about 2 billion document ids.   I assume this is the
>> > lucene internal document id and would therefore be a per index/per shard
>> > limit.  Is this correct?
>> >
>> >
>> > Tom Burton-West.
>> >
>> >
>> >
>> >
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Experience with indexing billions of documents?

2010-04-02 Thread Rich Cariens
A colleague of mine is using native Lucene + some home-grown
patches/optimizations to index over 13B small documents in a 32-shard
environment, which is around 406M docs per shard.

If there's a 2B doc id limitation in Lucene then I assume he's patched it
himself.

On Fri, Apr 2, 2010 at 1:17 PM,  wrote:

> My guess is that you will need to take advantage of Solr 1.5's upcoming
> cloud/cluster renovations and use multiple indexes to comfortably achieve
> those numbers. Hypthetically, in that case, you won't be limited by single
> index docid limitations of Lucene.
>
> > We are currently indexing 5 million books in Solr, scaling up over the
> > next few years to 20 million.  However we are using the entire book as a
> > Solr document.  We are evaluating the possibility of indexing individual
> > pages as there are some use cases where users want the most relevant
> pages
> > regardless of what book they occur in.  However, we estimate that we are
> > talking about somewhere between 1 and 6 billion pages and have concerns
> > over whether Solr will scale to this level.
> >
> > Does anyone have experience using Solr with 1-6 billion Solr documents?
> >
> > The lucene file format document
> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> > mentions a limit of about 2 billion document ids.   I assume this is the
> > lucene internal document id and would therefore be a per index/per shard
> > limit.  Is this correct?
> >
> >
> > Tom Burton-West.
> >
> >
> >
> >
>
>


Re: Experience with indexing billions of documents?

2010-04-02 Thread Peter Sturge
You can do this today with multiple indexes, replication and distributed
searching.
SolrCloud/clustering will certainly make life easier when it comes to
managing these,
but with distributed searches over multiple indexes, you're limited only by
how much hardware you can throw at it.


On Fri, Apr 2, 2010 at 6:17 PM,  wrote:

> My guess is that you will need to take advantage of Solr 1.5's upcoming
> cloud/cluster renovations and use multiple indexes to comfortably achieve
> those numbers. Hypthetically, in that case, you won't be limited by single
> index docid limitations of Lucene.
>
> > We are currently indexing 5 million books in Solr, scaling up over the
> > next few years to 20 million.  However we are using the entire book as a
> > Solr document.  We are evaluating the possibility of indexing individual
> > pages as there are some use cases where users want the most relevant
> pages
> > regardless of what book they occur in.  However, we estimate that we are
> > talking about somewhere between 1 and 6 billion pages and have concerns
> > over whether Solr will scale to this level.
> >
> > Does anyone have experience using Solr with 1-6 billion Solr documents?
> >
> > The lucene file format document
> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> > mentions a limit of about 2 billion document ids.   I assume this is the
> > lucene internal document id and would therefore be a per index/per shard
> > limit.  Is this correct?
> >
> >
> > Tom Burton-West.
> >
> >
> >
> >
>
>


Re: Experience with indexing billions of documents?

2010-04-02 Thread darren
My guess is that you will need to take advantage of Solr 1.5's upcoming
cloud/cluster renovations and use multiple indexes to comfortably achieve
those numbers. Hypthetically, in that case, you won't be limited by single
index docid limitations of Lucene.

> We are currently indexing 5 million books in Solr, scaling up over the
> next few years to 20 million.  However we are using the entire book as a
> Solr document.  We are evaluating the possibility of indexing individual
> pages as there are some use cases where users want the most relevant pages
> regardless of what book they occur in.  However, we estimate that we are
> talking about somewhere between 1 and 6 billion pages and have concerns
> over whether Solr will scale to this level.
>
> Does anyone have experience using Solr with 1-6 billion Solr documents?
>
> The lucene file format document
> (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> mentions a limit of about 2 billion document ids.   I assume this is the
> lucene internal document id and would therefore be a per index/per shard
> limit.  Is this correct?
>
>
> Tom Burton-West.
>
>
>
>



Experience with indexing billions of documents?

2010-04-02 Thread Burton-West, Tom
We are currently indexing 5 million books in Solr, scaling up over the next few 
years to 20 million.  However we are using the entire book as a Solr document.  
We are evaluating the possibility of indexing individual pages as there are 
some use cases where users want the most relevant pages regardless of what book 
they occur in.  However, we estimate that we are talking about somewhere 
between 1 and 6 billion pages and have concerns over whether Solr will scale to 
this level.

Does anyone have experience using Solr with 1-6 billion Solr documents?

The lucene file format document 
(http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)  mentions a 
limit of about 2 billion document ids.   I assume this is the lucene internal 
document id and would therefore be a per index/per shard limit.  Is this 
correct?


Tom Burton-West.