Hi, Sorry about the delay, I was away.
On 12/19/07, Mikkel Kamstrup Erlandsen <[EMAIL PROTECTED]> wrote: > From what I understood you needed to build Documents because you > wanted to do post sorting on some of the fields. Lucene's > Searcher.search(Query q, Sort s) is really fast and the sorter has > access to non-tokenized fields without creating the Documents (even > non-stored ones). This requires storing the mtime as an integer or > long though. Searcher.search(Query q, Sort s) only sorts a subset of documents returned from a search. By default, 100. The sort doesn't apply over the entire result space. Things tend to "just work" however because the field sorters assert that the documents are indexed in order. That is, if you're sorting by a timestamp field, you have to index the oldest document first and the newest last. Beagle doesn't work that way -- it indexes files as it comes across them -- and it would be prohibitive to try to do otherwise. Because we have to search two Lucene indexes for one set of results and because we have to potentially walk the entire result space, we use a much lower level API than the one which returns Hits collections. It seems from the Lucene mailing lists that use of the Hits API is largely discouraged in most non-trivial search applcations. > Also assuming that you don't have more than a few stored fields it > should still be fairly fast to create the Documents via Hits.doc(int > i) since it only adds the stored fields to the doc. All of our metadata is stored fields. Timestamp, MIME type, file name, email subject line, etc. So there is a fair amount of stored fields for each document. Remember, the penalty here is disk seek time, not the amount of data pulled off disk. > One hack we use at work is to encode the needed field data in one > stored field and then parse that blob for each hit and using the data > for display. Yeah, this may be something we could do for Beagle, but newer Lucenes allow us to pull stuff on demand. That's probably the biggest gain we can get. > What does "pull" mean exactly in the case in point? Just calling > Hits.doc(i) or is it a full rebuilding of the Document as it was added > to the index? I guess I've read it as more than doing a Hits.doc() at > least... Calling IndexSearcher.Doc(doc_id), which results in a Document object. Joe _______________________________________________ Xesam mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/xesam
