Re: Document numbers and ids
On Sunday 06 February 2005 20:00, Chris Hostetter wrote: > : > care about their content. I only want to know a particular numeric > : > field from > : > document (id of document's category). > : > I also need to know how many docs in category were found, so I can't > : > index > : > : You should explore the use of IndexReader. Index your documents with > : category id field, and use the methods on IndexReader to find all > : unique categories (TermEnum). > > to expand on erik's suggestion: once you know the complete list of > categories you iterate over then and execute your search once per > category, filtering each time on the category Id (to determine the number > of results from that category). Nah, I did a little more tricky thing, but promises to be faster (I have 12K categories now and there will be more). I index docs' categories ids as zero-padded keywords. Then I do search for documents, sorting them by category id. Then I iterate Hits following the scheme: 1. I have the cache that holds ids of documents in current category. 2. Each time I see doc id that is not in current category, I read that document and reload cache with it's category data. So if I found docs in N categories (N usually is not big), I really need to read exactly N docs from disk, the rest of iterating through Hits is just checking cache (because I sort by category). It's a pity lucene doesn't have IndexSearcher.search( Query, Sort, HitCollector ), but if I understood Hits properly, it gives me O( log2 ( doc_dum ) ) performance impact per resultset, which is perfectly acceptable. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Document numbers and ids
> By "renumbered", it means it squeezes out holes left by deletes. The > actual order does not change and thus does not affect a Sort.INDEXORDER > sort. > > Documents are stored in the index in the order that they were indexed - > nothing changes this order. Document id's are not permanent if deletes > occur followed by an optimize. Thanks for clarification, Erik. Could you answer one more question: can I control the assignment of document numbers during indexing? It would be very handy for me to have categories of documents aligned on some boudaries, e.g. category N numbers start on N*1. Obviously, there will be some holes in numeration with this scheme. Maybe I should explain, why I'm asking. I'm searching for documents, but for most (almost all) of them I don't really care about their content. I only want to know a particular numeric field from document (id of document's category). I also need to know how many docs in category were found, so I can't index categories instead of docs. The result set can be pertty big (30K) and all must be handled in inner loop. So I wanna use HitCollector and assign intervals of ids to categories of documents. Following this way, there's no need to actually retrieve document in inner loop. Am I on the right way? Mood: wondering, why SQL GROUP BY works so fast. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document numbers and ids
The LiA says that I can use Sort.INDEXORDER when indexing order is relevant and gives an example where documents' ids (got from Hits.id() ) are increasing from top to bottom of resultset. Are that ids the same thing as document numbers? If they are the same, how can it be that they are preserved during indexing process? LiA says that documents are renumbered when merging segments. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Search on heterogenous index
>Heterogenous Documents/indices are OK - check out the second hit: > > http://www.lucenebook.com/search?query=heterogenous+different Thanks, I'll consider buying "Lucene in Action". - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search on heterogenous index
Hello all. I'm new to lucene and think about using it in my project. I have prices with dynamic structure, containing wares there, about 10K prices with total 500K wares. Each price has about 5 text fields. I'll do searches on wares. The difficult part is that I'll do searches for all wares, the search is not bound to a particular price structure. My question is, how should I organize my indices? Can Lucene handle data effectlively if I'll have one index containing different Fields in Documents? Or should I create a separate index for each price with same Fields structure across Documents? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]