Re: Experience with indexing billions of documents?
Tom, Yes, we've (Biz360) indexed 3 billion and upwards... If indexing is the issue (or rather re-indexing) we used SOLR-1301 with Hadoop to re-index efficiently (ie, in a timely manner). For querying we're currently using the out of the box Solr distributed shards query mechanism, which is hard (read, near impossible) to customize. I've been writing SOLR-1724 which deploy cores out of HDFS. SOLR-1724 works in conjunction with Solr Cloud which should allow for more efficient failover. Katta has a nice model for replicating cores across multiple servers for redundancy. The issue with this is, it could feasibly require 2 times as many servers for 2 times replication. If you have more questions feel free to ping me or whatever. Cheers, Jason On Fri, Apr 2, 2010 at 8:57 AM, Burton-West, Tom wrote: > We are currently indexing 5 million books in Solr, scaling up over the next > few years to 20 million. However we are using the entire book as a Solr > document. We are evaluating the possibility of indexing individual pages as > there are some use cases where users want the most relevant pages regardless > of what book they occur in. However, we estimate that we are talking about > somewhere between 1 and 6 billion pages and have concerns over whether Solr > will scale to this level. > > Does anyone have experience using Solr with 1-6 billion Solr documents? > > The lucene file format document > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) mentions > a limit of about 2 billion document ids. I assume this is the lucene > internal document id and would therefore be a per index/per shard limit. Is > this correct? > > > Tom Burton-West. > > > >
Re: Experience with indexing billions of documents?
Bradford Stephens: > Hey there, > > We've actually been tackling this problem at Drawn to Scale. We'd really > like to get our hands on LuceHBase to see how it scales. Our faceting still > needs to be done in-memory, which is kinda tricky, but it's worth > exploring. Hi Bradford, thank you for your interest. Just yesterday I found out, that somebody else did apparently exactly the same as I did, porting lucandra to HBase: http://github.com/akkumar/hbasene I'll have a look at this project and most likely abandon luceHBase in favor of the other, since it's more advanced. Best regards, Thomas Koch, http://www.koch.ro
Re: Experience with indexing billions of documents?
Hey there, We've actually been tackling this problem at Drawn to Scale. We'd really like to get our hands on LuceHBase to see how it scales. Our faceting still needs to be done in-memory, which is kinda tricky, but it's worth exploring. On Mon, Apr 12, 2010 at 7:27 AM, Thomas Koch wrote: > Hi, > > could I interest you in this project? > http://github.com/thkoch2001/lucehbase > > The aim is to store the index directly in HBase, a database system modelled > after google's Bigtable to store data in the regions of tera or petabytes. > > Best regards, Thomas Koch > > Lance Norskog: > > The 2B limitation is within one shard, due to using a signed 32-bit > > integer. There is no limit in that regard in sharding- Distributed > > Search uses the stored unique document id rather than the internal > > docid. > > > > On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens > wrote: > > > A colleague of mine is using native Lucene + some home-grown > > > patches/optimizations to index over 13B small documents in a 32-shard > > > environment, which is around 406M docs per shard. > > > > > > If there's a 2B doc id limitation in Lucene then I assume he's patched > it > > > himself. > > > > > > On Fri, Apr 2, 2010 at 1:17 PM, wrote: > > >> My guess is that you will need to take advantage of Solr 1.5's > upcoming > > >> cloud/cluster renovations and use multiple indexes to comfortably > > >> achieve those numbers. Hypthetically, in that case, you won't be > limited > > >> by single index docid limitations of Lucene. > > >> > > >> > We are currently indexing 5 million books in Solr, scaling up over > the > > >> > next few years to 20 million. However we are using the entire book > as > > >> > a Solr document. We are evaluating the possibility of indexing > > >> > individual pages as there are some use cases where users want the > most > > >> > relevant > > >> > > >> pages > > >> > > >> > regardless of what book they occur in. However, we estimate that we > > >> > are talking about somewhere between 1 and 6 billion pages and have > > >> > concerns over whether Solr will scale to this level. > > >> > > > >> > Does anyone have experience using Solr with 1-6 billion Solr > > >> > documents? > > >> > > > >> > The lucene file format document > > >> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) > > >> > mentions a limit of about 2 billion document ids. I assume this is > > >> > the lucene internal document id and would therefore be a per > index/per > > >> > shard limit. Is this correct? > > >> > > > >> > > > >> > Tom Burton-West. > > > > Thomas Koch, http://www.koch.ro > -- Bradford Stephens, Founder, Drawn to Scale drawntoscalehq.com 727.697.7528 http://www.drawntoscalehq.com -- The intuitive, cloud-scale data solution. Process, store, query, search, and serve all your data. http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
Re: Experience with indexing billions of documents?
Hi, could I interest you in this project? http://github.com/thkoch2001/lucehbase The aim is to store the index directly in HBase, a database system modelled after google's Bigtable to store data in the regions of tera or petabytes. Best regards, Thomas Koch Lance Norskog: > The 2B limitation is within one shard, due to using a signed 32-bit > integer. There is no limit in that regard in sharding- Distributed > Search uses the stored unique document id rather than the internal > docid. > > On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens wrote: > > A colleague of mine is using native Lucene + some home-grown > > patches/optimizations to index over 13B small documents in a 32-shard > > environment, which is around 406M docs per shard. > > > > If there's a 2B doc id limitation in Lucene then I assume he's patched it > > himself. > > > > On Fri, Apr 2, 2010 at 1:17 PM, wrote: > >> My guess is that you will need to take advantage of Solr 1.5's upcoming > >> cloud/cluster renovations and use multiple indexes to comfortably > >> achieve those numbers. Hypthetically, in that case, you won't be limited > >> by single index docid limitations of Lucene. > >> > >> > We are currently indexing 5 million books in Solr, scaling up over the > >> > next few years to 20 million. However we are using the entire book as > >> > a Solr document. We are evaluating the possibility of indexing > >> > individual pages as there are some use cases where users want the most > >> > relevant > >> > >> pages > >> > >> > regardless of what book they occur in. However, we estimate that we > >> > are talking about somewhere between 1 and 6 billion pages and have > >> > concerns over whether Solr will scale to this level. > >> > > >> > Does anyone have experience using Solr with 1-6 billion Solr > >> > documents? > >> > > >> > The lucene file format document > >> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) > >> > mentions a limit of about 2 billion document ids. I assume this is > >> > the lucene internal document id and would therefore be a per index/per > >> > shard limit. Is this correct? > >> > > >> > > >> > Tom Burton-West. > Thomas Koch, http://www.koch.ro
Re: Experience with indexing billions of documents?
The 2B limitation is within one shard, due to using a signed 32-bit integer. There is no limit in that regard in sharding- Distributed Search uses the stored unique document id rather than the internal docid. On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens wrote: > A colleague of mine is using native Lucene + some home-grown > patches/optimizations to index over 13B small documents in a 32-shard > environment, which is around 406M docs per shard. > > If there's a 2B doc id limitation in Lucene then I assume he's patched it > himself. > > On Fri, Apr 2, 2010 at 1:17 PM, wrote: > >> My guess is that you will need to take advantage of Solr 1.5's upcoming >> cloud/cluster renovations and use multiple indexes to comfortably achieve >> those numbers. Hypthetically, in that case, you won't be limited by single >> index docid limitations of Lucene. >> >> > We are currently indexing 5 million books in Solr, scaling up over the >> > next few years to 20 million. However we are using the entire book as a >> > Solr document. We are evaluating the possibility of indexing individual >> > pages as there are some use cases where users want the most relevant >> pages >> > regardless of what book they occur in. However, we estimate that we are >> > talking about somewhere between 1 and 6 billion pages and have concerns >> > over whether Solr will scale to this level. >> > >> > Does anyone have experience using Solr with 1-6 billion Solr documents? >> > >> > The lucene file format document >> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) >> > mentions a limit of about 2 billion document ids. I assume this is the >> > lucene internal document id and would therefore be a per index/per shard >> > limit. Is this correct? >> > >> > >> > Tom Burton-West. >> > >> > >> > >> > >> >> > -- Lance Norskog goks...@gmail.com
Re: Experience with indexing billions of documents?
A colleague of mine is using native Lucene + some home-grown patches/optimizations to index over 13B small documents in a 32-shard environment, which is around 406M docs per shard. If there's a 2B doc id limitation in Lucene then I assume he's patched it himself. On Fri, Apr 2, 2010 at 1:17 PM, wrote: > My guess is that you will need to take advantage of Solr 1.5's upcoming > cloud/cluster renovations and use multiple indexes to comfortably achieve > those numbers. Hypthetically, in that case, you won't be limited by single > index docid limitations of Lucene. > > > We are currently indexing 5 million books in Solr, scaling up over the > > next few years to 20 million. However we are using the entire book as a > > Solr document. We are evaluating the possibility of indexing individual > > pages as there are some use cases where users want the most relevant > pages > > regardless of what book they occur in. However, we estimate that we are > > talking about somewhere between 1 and 6 billion pages and have concerns > > over whether Solr will scale to this level. > > > > Does anyone have experience using Solr with 1-6 billion Solr documents? > > > > The lucene file format document > > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) > > mentions a limit of about 2 billion document ids. I assume this is the > > lucene internal document id and would therefore be a per index/per shard > > limit. Is this correct? > > > > > > Tom Burton-West. > > > > > > > > > >
Re: Experience with indexing billions of documents?
You can do this today with multiple indexes, replication and distributed searching. SolrCloud/clustering will certainly make life easier when it comes to managing these, but with distributed searches over multiple indexes, you're limited only by how much hardware you can throw at it. On Fri, Apr 2, 2010 at 6:17 PM, wrote: > My guess is that you will need to take advantage of Solr 1.5's upcoming > cloud/cluster renovations and use multiple indexes to comfortably achieve > those numbers. Hypthetically, in that case, you won't be limited by single > index docid limitations of Lucene. > > > We are currently indexing 5 million books in Solr, scaling up over the > > next few years to 20 million. However we are using the entire book as a > > Solr document. We are evaluating the possibility of indexing individual > > pages as there are some use cases where users want the most relevant > pages > > regardless of what book they occur in. However, we estimate that we are > > talking about somewhere between 1 and 6 billion pages and have concerns > > over whether Solr will scale to this level. > > > > Does anyone have experience using Solr with 1-6 billion Solr documents? > > > > The lucene file format document > > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) > > mentions a limit of about 2 billion document ids. I assume this is the > > lucene internal document id and would therefore be a per index/per shard > > limit. Is this correct? > > > > > > Tom Burton-West. > > > > > > > > > >
Re: Experience with indexing billions of documents?
My guess is that you will need to take advantage of Solr 1.5's upcoming cloud/cluster renovations and use multiple indexes to comfortably achieve those numbers. Hypthetically, in that case, you won't be limited by single index docid limitations of Lucene. > We are currently indexing 5 million books in Solr, scaling up over the > next few years to 20 million. However we are using the entire book as a > Solr document. We are evaluating the possibility of indexing individual > pages as there are some use cases where users want the most relevant pages > regardless of what book they occur in. However, we estimate that we are > talking about somewhere between 1 and 6 billion pages and have concerns > over whether Solr will scale to this level. > > Does anyone have experience using Solr with 1-6 billion Solr documents? > > The lucene file format document > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) > mentions a limit of about 2 billion document ids. I assume this is the > lucene internal document id and would therefore be a per index/per shard > limit. Is this correct? > > > Tom Burton-West. > > > >
Experience with indexing billions of documents?
We are currently indexing 5 million books in Solr, scaling up over the next few years to 20 million. However we are using the entire book as a Solr document. We are evaluating the possibility of indexing individual pages as there are some use cases where users want the most relevant pages regardless of what book they occur in. However, we estimate that we are talking about somewhere between 1 and 6 billion pages and have concerns over whether Solr will scale to this level. Does anyone have experience using Solr with 1-6 billion Solr documents? The lucene file format document (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) mentions a limit of about 2 billion document ids. I assume this is the lucene internal document id and would therefore be a per index/per shard limit. Is this correct? Tom Burton-West.