Yonik Seeley wrote:
On 4/4/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
I am not sure you need 509 when you have Lazy loading.
It would be nice to avoid creating a Field object at all... we have
some crazy documents with more than 1000 fields :-) I think the Field
object itself takes up more room than the data.
For my usecases, specifying which fields should be lazily loaded
doesn't work well... I know which fields I want, not which ones I
don't.
true, true. In looking at the code, I don't think it is that hard to
do. As 509 states, the main issue is you still need to read in all the
Fields in a document.
Mark Harwood had an interesting post earlier on this same thread about
some other possibilities for interfaces.
My use case is below (my guess is this is quite common).
Run a search, get back your hits and display summary information on the
hits (i.e. the "small" fields). User picks the Hit they want to see
more info on, go display the full document
It seems like the only way this can work is if you keep the index
searcher open and cache the Hits object that the user used. How long
do you keep that searcher open waiting for the user to do something?
I guess it could work as long as you have logic to re-execute the
query if the searcher changes...
Yeah, we aren't updating a lot, so we cache the searchers. If you
followed the other thread I have going on the "Semantics of
IndexInput...", Doug and I discussed that accessing the stream becomes
undefined after the stream is closed. So, while it does still work to
load in some cases, it isn't guaranteed and any application would need
to be able to handle this.
, including, most likely, the
info in the really large stored fields (i.e the original document). To
date, I have been storing this info elsewhere b/c of the loading
penalty. With lazy loading, I don't need to do this. I can just defer
loading until the second level access is needed and I never load it if
the user doesn't ask for it.
Actually, for really large text fields, I can see that you wouldn't
want lucene to re-parse the fields anyway, so I agree that lazy
loading helps there.
In the case where you only get a few smaller fields, you have to go back
and get the document again when you want to display the contents of the
large field.
Of course, there are several other use cases where you may only want
certain fields, but I don't think there is much cost associated with
loading small fields, just the large ones, so you can just make them lazy.
Part of the cost is iterating through all the fields of the Document
looking for the one or two you want.
Yeah, not sure if there is a good solution to this. Maybe altering the
file formats such that you store all the meta info about a field up
front and then the field data somehow. This would at least speed it
up. One of the things I think both SOLR and what we call IR Tools at
CNLP (see my ApacheCon talk) does is provide better access to the
metadata about fields/indexes, etc. It is hard, in Lucene, to know what
fields belong to what documents and how they are indexed. You must save
this information in your application, even though most, if not all of
it, is already in Lucene in some form.
I will take a crack at this sometime later and see if I can implement
some of the ideas we have discussed.
As I see it, we have a few goals:
1. Retrieve only the fields someone wants
2. Retrieve only all fields, but leave some to be lazily loaded
3. Provide SQL like functionality (as Mark suggested) [a bit harder and
more involved????]
--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]