Re: Lazy Field Loading

Grant Ingersoll Tue, 04 Apr 2006 14:48:55 -0700


Yonik Seeley wrote:

On 4/4/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

I am not sure you need 509 when you have Lazy loading.


It would be nice to avoid creating a Field object at all... we have
some crazy documents with more than 1000 fields :-)  I think the Field
object itself takes up more room than the data.

For my usecases, specifying which fields should be lazily loaded
doesn't work well...  I know which fields I want, not which ones I
don't.

true, true. In looking at the code, I don't think it is that hard todo. As 509 states, the main issue is you still need to read in all theFields in a document.

Mark Harwood had an interesting post earlier on this same thread aboutsome other possibilities for interfaces.

My use case is below (my guess is this is quite common).

Run a search, get back your hits and display summary information on the
hits (i.e. the "small" fields).  User picks the Hit they want to see
more info on, go display the full document


It seems like the only way this can work is if you keep the index
searcher open and cache the Hits object that the user used.  How long

do you keep that searcher open waiting for the user to do something?I guess it could work as long as you have logic to re-execute the

query if the searcher changes...

Yeah, we aren't updating a lot, so we cache the searchers. If youfollowed the other thread I have going on the "Semantics ofIndexInput...", Doug and I discussed that accessing the stream becomesundefined after the stream is closed. So, while it does still work toload in some cases, it isn't guaranteed and any application would needto be able to handle this.

, including, most likely, the
info in the really large stored fields (i.e the original document).  To
date, I have been storing this info elsewhere b/c of the loading
penalty.  With lazy loading, I don't need to do this.  I can just defer
loading until the second level access is needed and I never load it if
the user doesn't ask for it.


Actually, for really large text fields, I can see that you wouldn't
want lucene to re-parse the fields anyway, so I agree that lazy
loading helps there.

In the case where you only get a few smaller fields, you have to go back
and get the document again when you want to display the contents of the
large field.

Of course, there are several other use cases where you may only want
certain fields, but I don't think there is much cost associated with
loading small fields, just the large ones, so you can just make them lazy.


Part of the cost is iterating through all the fields of the Document
looking for the one or two you want.

Yeah, not sure if there is a good solution to this. Maybe altering thefile formats such that you store all the meta info about a field upfront and then the field data somehow. This would at least speed itup. One of the things I think both SOLR and what we call IR Tools atCNLP (see my ApacheCon talk) does is provide better access to themetadata about fields/indexes, etc. It is hard, in Lucene, to know whatfields belong to what documents and how they are indexed. You must savethis information in your application, even though most, if not all ofit, is already in Lucene in some form.

I will take a crack at this sometime later and see if I can implementsome of the ideas we have discussed.


As I see it, we have a few goals:
1. Retrieve only the fields someone wants
2. Retrieve only all fields, but leave some to be lazily loaded

3. Provide SQL like functionality (as Mark suggested) [a bit harder andmore involved????]

--

Grant IngersollSr. Software EngineerCenter for Natural Language ProcessingSyracuse UniversitySchool of Information Studies335 Hinds HallSyracuse, NY 13244http://www.cnlp.orgVoice: 315-443-5484Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lazy Field Loading

Reply via email to