Re: question about using lucene on large documents

2014-02-04 Thread Michael Sokolov
Ideally you would chunk a document at logical boundaries that will make 
sense as units of both search and presentation.  For some content, these 
boundaries don't align; for example you might want to search for matches 
within a paragraph scope, or within a section, chapter, or part of a 
book, but often books break down neatly into a sequence of more-or-less 
self-contained units (usu. bigger than paragraphs, though: think chapters).


If you need to be concerned about overlapping scopes, I would create a 
nested dolls container structure so you can choose which level to search 
at and to display, maintaining links between the documents so you can 
navigate or re-assemble it later.  Don't be afraid of the inefficiency 
if you need it, but don't create it if you don't, because it will 
complexify your life.


Basically - there is no single right answer; it depends on the content 
and the use cases.


-Mike

On 2/4/2014 3:53 PM, mrodent wrote:

Hi,

This question may well be very familiar to experienced Lucene people... in
which case all I need is to be pointed somewhere.  I am new.

If you have a large document, e.g. a large Word file, and you want to split
it into text, e.g. by using Apache POI, what techniques are best used?

It seems to me that if you split it so that the text of each paragraph
becomes a Document (in the Lucene index sense) then obviously each search
will only be carried out within that para... so maybe you should split it
into blocks of text, i.e. a run of paras where no text-free (white space
only) paras occur.  But supposing those are too big as Documents, or too
small as Documents?

It occurs to me that under some circs you might actually want your Documents
to be "overlapping"... i.e. the text at the end of one Document is also the
text at the beginning of the next Document... thus making it more unlikely
that the index will miss terms which are quite close to one another.

But surely this must be an inefficient way of storing index data (and all
the more so the text "content" itself)... because repetitious.

So then it makes me wonder whether the developers behind Lucene have made
provision for such circs ... is there a way of making the presence of a
search term in Document N influence the ranking of Document N+1 (for example
if another search term is found in the latter)?  Or rather, both Documents,
as a pair, should then be given a ranking, as a pair of Documents.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-about-using-lucene-on-large-documents-tp4115343.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: question about using lucene on large documents

2014-02-04 Thread mrodent
Thanks, gives me food for thought.  So no { N, N+1 } ideas specifically...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-about-using-lucene-on-large-documents-tp4115343p4115465.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: question about using lucene on large documents

2014-02-05 Thread Michael Sokolov
No, not really.  What would you do if you had a match contained entirely 
within the overlapping region? You'd probably need a way to distinguish 
that from a term that matched in two adjacent chunks, but *not* in the 
overlap.  Sounds very tricky to me.


-Mike

On 2/5/2014 2:21 AM, mrodent wrote:

Thanks, gives me food for thought.  So no { N, N+1 } ideas specifically...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-about-using-lucene-on-large-documents-tp4115343p4115465.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org