Re: 400 MB Fields

2011-06-08 Thread Alexander Kanarsky
Otis,

Not sure about the Solr, but with Lucene It was certainly doable. I
saw fields way bigger than 400Mb indexed, sometimes having a large set
of unique terms as well (think something like log file with lots of
alphanumeric tokens, couple of gigs in size). While indexing and
querying of such things the I/O, naturally, could easily become a
bottleneck.

-Alexander


400 MB Fields

2011-06-07 Thread Otis Gospodnetic
Hello,

What are the biggest document fields that you've ever indexed in Solr or that 
you've heard of?  Ah, it must be Tom's Hathi trust. :)

I'm asking because I just heard of a case of an index where some documents 
having a field that can be around 400 MB in size!  I'm curious if anyone has 
any 
experience with such monster fields?
Crazy?  Yes, sure.
Doable?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: 400 MB Fields

2011-06-07 Thread Erick Erickson
From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia
of Michigan Civil War Volunteers in a single document/field, so it's probably
within the realm of possibility at least G...

Erick

On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello,

 What are the biggest document fields that you've ever indexed in Solr or that
 you've heard of?  Ah, it must be Tom's Hathi trust. :)

 I'm asking because I just heard of a case of an index where some documents
 having a field that can be around 400 MB in size!  I'm curious if anyone has 
 any
 experience with such monster fields?
 Crazy?  Yes, sure.
 Doable?

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/




Re: 400 MB Fields

2011-06-07 Thread Fuad Efendi
I think the question is strange... May be you are wondering about possible
OOM exceptions? I think we can pass to Lucene single document containing
comma separated list of term, term, ... (few billion times)... Except
stored and TermVectorComponent...

I believe thousands companies already indexed millions documents with
average size few hundreds Mbytes... There should not be any limits (except
InputSource vs. ByteArray)

100,000 _unique_ terms vs. single document containing 100,000,000,000,000
of non-unique terms (and trying to store offsets)

What about Spell Checker feature? Is anyone tried to index single
terabytes-like document?

Personally, I indexed only small (up to 1000 bytes) documents-fields, but
I believe 500Mb is very common use case with PDFs (which vendors use
Lucene already? Eclipse? To index Eclipse Help file? Even Microsoft uses
Lucene...)


Fuad




On 11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com wrote:

From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia
of Michigan Civil War Volunteers in a single document/field, so it's
probably
within the realm of possibility at least G...

Erick

On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello,

 What are the biggest document fields that you've ever indexed in Solr
or that
 you've heard of?  Ah, it must be Tom's Hathi trust. :)

 I'm asking because I just heard of a case of an index where some
documents
 having a field that can be around 400 MB in size!  I'm curious if
anyone has any
 experience with such monster fields?
 Crazy?  Yes, sure.
 Doable?

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/






Re: 400 MB Fields

2011-06-07 Thread Otis Gospodnetic
Hi,


 I think the question is strange... May be you are wondering about  possible
 OOM exceptions? 

No, that's an easier one. I was more wondering whether with 400 MB Fields 
(indexed, not stored) it becomes incredibly slow to:
* analyze
* commit / write to disk
* search

 I think we can pass to Lucene single document  containing
 comma separated list of term, term, ... (few billion times)...  Except
 stored and TermVectorComponent...

Oh, I know it can be done, but I'm wondering how bad things (like the ones 
above) get.

 I believe thousands  companies already indexed millions documents with
 average size few hundreds  Mbytes... There should not be any limits (except

Which ones are you thinking about?  What sort of documents?

 100,000 _unique_ terms vs. single document containing  100,000,000,000,000
 of non-unique terms (and trying to store  offsets)
 
 Personally, I indexed only small (up  to 1000 bytes) documents-fields, but
 I believe 500Mb is very common use case  with PDFs (which vendors use

Nah, PDF files may be big, but I think the text in them is often not *that* 
big, 
unless those are PDFs of very big books.

Thanks,
Otis


 On  11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com  wrote:
 
 From older (2.4) Lucene days, I once indexed the 23 volume  Encyclopedia
 of Michigan Civil War Volunteers in a single  document/field, so it's
 probably
 within the realm of possibility  at least G...
 
 Erick
 
 On Tue, Jun 7, 2011 at  6:59 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com  wrote:
  Hello,
 
  What are the biggest document  fields that you've ever indexed in Solr
 or that
  you've  heard of?  Ah, it must be Tom's Hathi trust. :)
 
  I'm  asking because I just heard of a case of an index where  some
 documents
  having a field that can be around 400 MB  in size!  I'm curious if
 anyone has any
  experience  with such monster fields?
  Crazy?  Yes, sure.
   Doable?
 
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 
  



Re: 400 MB Fields

2011-06-07 Thread Fuad Efendi
Hi Otis,


I am recalling pagination feature, it is still unresolved (with default
scoring implementation): even with small documents, searching-retrieving
documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
take few minutes (I saw it with trunk version 6 months ago, and with very
small documents, total 100 mlns docs); it is advisable to restrict search
results to top-1000 in any case (as with Google)...



I believe things can get wrong; yes, most plain-text retrieved from books
should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for
UTF-8)

Theoretically, it doesn't make any sense to index BIG document containing
all terms from dictionary without any terms frequency calcs, but even
with it... I can't imagine we should index 1000s docs and each is just
(different) version of whole Wikipedia, should be wrong design...

Ok, use case: index single HUGE document. What will we do? Create index
with _the_only_ document? And all search will return the same result (or
nothing)? Paginate it; split into pages. I am pragmatic...


Fuad



On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

Hi,


 I think the question is strange... May be you are wondering about
possible
 OOM exceptions? 

No, that's an easier one. I was more wondering whether with 400 MB Fields
(indexed, not stored) it becomes incredibly slow to:
* analyze
* commit / write to disk
* search

 I think we can pass to Lucene single document  containing
 comma separated list of term, term, ... (few billion times)...  Except
 stored and TermVectorComponent...




Re: 400 MB Fields

2011-06-07 Thread Lance Norskog
The Salesforce book is 2800 pages of PDF, last I looked.

What can you do with a field that big? Can you get all of the snippets?

On Tue, Jun 7, 2011 at 5:33 PM, Fuad Efendi f...@efendi.ca wrote:
 Hi Otis,


 I am recalling pagination feature, it is still unresolved (with default
 scoring implementation): even with small documents, searching-retrieving
 documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can
 take few minutes (I saw it with trunk version 6 months ago, and with very
 small documents, total 100 mlns docs); it is advisable to restrict search
 results to top-1000 in any case (as with Google)...



 I believe things can get wrong; yes, most plain-text retrieved from books
 should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for
 UTF-8)

 Theoretically, it doesn't make any sense to index BIG document containing
 all terms from dictionary without any terms frequency calcs, but even
 with it... I can't imagine we should index 1000s docs and each is just
 (different) version of whole Wikipedia, should be wrong design...

 Ok, use case: index single HUGE document. What will we do? Create index
 with _the_only_ document? And all search will return the same result (or
 nothing)? Paginate it; split into pages. I am pragmatic...


 Fuad



 On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

Hi,


 I think the question is strange... May be you are wondering about
possible
 OOM exceptions?

No, that's an easier one. I was more wondering whether with 400 MB Fields
(indexed, not stored) it becomes incredibly slow to:
* analyze
* commit / write to disk
* search

 I think we can pass to Lucene single document  containing
 comma separated list of term, term, ... (few billion times)...  Except
 stored and TermVectorComponent...






-- 
Lance Norskog
goks...@gmail.com


RE: 400 MB Fields

2011-06-07 Thread Burton-West, Tom
Hi Otis, 

Our OCR fields average around 800 KB.  My guess is that the largest docs we 
index (in a single OCR field) are somewhere between 2 and 10MB.  We have had 
issues where the in-memory representation of the document (the in memory index 
structures being built)is several times the size of the text, so I would 
suspect even with the largest ramBufferSizeMB, you might run into problems.  
(This is with the 3.x branch.  Trunk might not have this problem since it's 
much more memory efficient when indexing

Tom Burton-West
www.hathitrust.org/blogs

From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
Sent: Tuesday, June 07, 2011 6:59 PM
To: solr-user@lucene.apache.org
Subject: 400 MB Fields

Hello,

What are the biggest document fields that you've ever indexed in Solr or that
you've heard of?  Ah, it must be Tom's Hathi trust. :)

I'm asking because I just heard of a case of an index where some documents
having a field that can be around 400 MB in size!  I'm curious if anyone has any
experience with such monster fields?
Crazy?  Yes, sure.
Doable?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/