Yonik Seeley wrote:
On 4/4/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
I am not sure you need 509 when you have Lazy loading.

It would be nice to avoid creating a Field object at all... we have
some crazy documents with more than 1000 fields :-)  I think the Field
object itself takes up more room than the data.

For my usecases, specifying which fields should be lazily loaded
doesn't work well...  I know which fields I want, not which ones I
don't.

true, true. In looking at the code, I don't think it is that hard to do. As 509 states, the main issue is you still need to read in all the Fields in a document.

Mark Harwood had an interesting post earlier on this same thread about some other possibilities for interfaces.


My use case is below (my guess is this is quite common).

Run a search, get back your hits and display summary information on the
hits (i.e. the "small" fields).  User picks the Hit they want to see
more info on, go display the full document

It seems like the only way this can work is if you keep the index
searcher open and cache the Hits object that the user used.  How long
do you keep that searcher open waiting for the user to do something? I guess it could work as long as you have logic to re-execute the
query if the searcher changes...

Yeah, we aren't updating a lot, so we cache the searchers. If you followed the other thread I have going on the "Semantics of IndexInput...", Doug and I discussed that accessing the stream becomes undefined after the stream is closed. So, while it does still work to load in some cases, it isn't guaranteed and any application would need to be able to handle this.
, including, most likely, the
info in the really large stored fields (i.e the original document).  To
date, I have been storing this info elsewhere b/c of the loading
penalty.  With lazy loading, I don't need to do this.  I can just defer
loading until the second level access is needed and I never load it if
the user doesn't ask for it.

Actually, for really large text fields, I can see that you wouldn't
want lucene to re-parse the fields anyway, so I agree that lazy
loading helps there.

In the case where you only get a few smaller fields, you have to go back
and get the document again when you want to display the contents of the
large field.

Of course, there are several other use cases where you may only want
certain fields, but I don't think there is much cost associated with
loading small fields, just the large ones, so you can just make them lazy.

Part of the cost is iterating through all the fields of the Document
looking for the one or two you want.


Yeah, not sure if there is a good solution to this. Maybe altering the file formats such that you store all the meta info about a field up front and then the field data somehow. This would at least speed it up. One of the things I think both SOLR and what we call IR Tools at CNLP (see my ApacheCon talk) does is provide better access to the metadata about fields/indexes, etc. It is hard, in Lucene, to know what fields belong to what documents and how they are indexed. You must save this information in your application, even though most, if not all of it, is already in Lucene in some form.

I will take a crack at this sometime later and see if I can implement some of the ideas we have discussed.

As I see it, we have a few goals:
1. Retrieve only the fields someone wants
2. Retrieve only all fields, but leave some to be lazily loaded
3. Provide SQL like functionality (as Mark suggested) [a bit harder and more involved????]

--

Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University School of Information Studies 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Fax: 315-443-6886

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to