Re: 400 MB Fields
Otis, Not sure about the Solr, but with Lucene It was certainly doable. I saw fields way bigger than 400Mb indexed, sometimes having a large set of unique terms as well (think something like log file with lots of alphanumeric tokens, couple of gigs in size). While indexing and querying of such things the I/O, naturally, could easily become a bottleneck. -Alexander
400 MB Fields
Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: 400 MB Fields
From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia of Michigan Civil War Volunteers in a single document/field, so it's probably within the realm of possibility at least G... Erick On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: 400 MB Fields
I think the question is strange... May be you are wondering about possible OOM exceptions? I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent... I believe thousands companies already indexed millions documents with average size few hundreds Mbytes... There should not be any limits (except InputSource vs. ByteArray) 100,000 _unique_ terms vs. single document containing 100,000,000,000,000 of non-unique terms (and trying to store offsets) What about Spell Checker feature? Is anyone tried to index single terabytes-like document? Personally, I indexed only small (up to 1000 bytes) documents-fields, but I believe 500Mb is very common use case with PDFs (which vendors use Lucene already? Eclipse? To index Eclipse Help file? Even Microsoft uses Lucene...) Fuad On 11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com wrote: From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia of Michigan Civil War Volunteers in a single document/field, so it's probably within the realm of possibility at least G... Erick On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: 400 MB Fields
Hi, I think the question is strange... May be you are wondering about possible OOM exceptions? No, that's an easier one. I was more wondering whether with 400 MB Fields (indexed, not stored) it becomes incredibly slow to: * analyze * commit / write to disk * search I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent... Oh, I know it can be done, but I'm wondering how bad things (like the ones above) get. I believe thousands companies already indexed millions documents with average size few hundreds Mbytes... There should not be any limits (except Which ones are you thinking about? What sort of documents? 100,000 _unique_ terms vs. single document containing 100,000,000,000,000 of non-unique terms (and trying to store offsets) Personally, I indexed only small (up to 1000 bytes) documents-fields, but I believe 500Mb is very common use case with PDFs (which vendors use Nah, PDF files may be big, but I think the text in them is often not *that* big, unless those are PDFs of very big books. Thanks, Otis On 11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com wrote: From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia of Michigan Civil War Volunteers in a single document/field, so it's probably within the realm of possibility at least G... Erick On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: 400 MB Fields
Hi Otis, I am recalling pagination feature, it is still unresolved (with default scoring implementation): even with small documents, searching-retrieving documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can take few minutes (I saw it with trunk version 6 months ago, and with very small documents, total 100 mlns docs); it is advisable to restrict search results to top-1000 in any case (as with Google)... I believe things can get wrong; yes, most plain-text retrieved from books should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for UTF-8) Theoretically, it doesn't make any sense to index BIG document containing all terms from dictionary without any terms frequency calcs, but even with it... I can't imagine we should index 1000s docs and each is just (different) version of whole Wikipedia, should be wrong design... Ok, use case: index single HUGE document. What will we do? Create index with _the_only_ document? And all search will return the same result (or nothing)? Paginate it; split into pages. I am pragmatic... Fuad On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I think the question is strange... May be you are wondering about possible OOM exceptions? No, that's an easier one. I was more wondering whether with 400 MB Fields (indexed, not stored) it becomes incredibly slow to: * analyze * commit / write to disk * search I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent...
Re: 400 MB Fields
The Salesforce book is 2800 pages of PDF, last I looked. What can you do with a field that big? Can you get all of the snippets? On Tue, Jun 7, 2011 at 5:33 PM, Fuad Efendi f...@efendi.ca wrote: Hi Otis, I am recalling pagination feature, it is still unresolved (with default scoring implementation): even with small documents, searching-retrieving documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can take few minutes (I saw it with trunk version 6 months ago, and with very small documents, total 100 mlns docs); it is advisable to restrict search results to top-1000 in any case (as with Google)... I believe things can get wrong; yes, most plain-text retrieved from books should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for UTF-8) Theoretically, it doesn't make any sense to index BIG document containing all terms from dictionary without any terms frequency calcs, but even with it... I can't imagine we should index 1000s docs and each is just (different) version of whole Wikipedia, should be wrong design... Ok, use case: index single HUGE document. What will we do? Create index with _the_only_ document? And all search will return the same result (or nothing)? Paginate it; split into pages. I am pragmatic... Fuad On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I think the question is strange... May be you are wondering about possible OOM exceptions? No, that's an easier one. I was more wondering whether with 400 MB Fields (indexed, not stored) it becomes incredibly slow to: * analyze * commit / write to disk * search I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent... -- Lance Norskog goks...@gmail.com
RE: 400 MB Fields
Hi Otis, Our OCR fields average around 800 KB. My guess is that the largest docs we index (in a single OCR field) are somewhere between 2 and 10MB. We have had issues where the in-memory representation of the document (the in memory index structures being built)is several times the size of the text, so I would suspect even with the largest ramBufferSizeMB, you might run into problems. (This is with the 3.x branch. Trunk might not have this problem since it's much more memory efficient when indexing Tom Burton-West www.hathitrust.org/blogs From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, June 07, 2011 6:59 PM To: solr-user@lucene.apache.org Subject: 400 MB Fields Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/