Re: Document numbers and ids

2005-02-06 Thread Simeon Koptelov
On Sunday 06 February 2005 20:00, Chris Hostetter wrote:
> : > care about their content. I only want to know a particular numeric
> : > field from
> : > document (id of document's category).
> : > I also need to know how many docs in category were found, so I can't
> : > index
> :
> : You should explore the use of IndexReader.  Index your documents with
> : category id field, and use the methods on IndexReader to find all
> : unique categories (TermEnum).
>
> to expand on erik's suggestion: once you know the complete list of
> categories you iterate over then and execute your search once per
> category, filtering each time on the category Id (to determine the number
> of results from that category).

Nah, I did a little more tricky thing, but promises to be faster (I have 12K 
categories now and there will be more).
I index docs' categories ids as zero-padded keywords. Then I do search for 
documents, sorting them by category id. Then I iterate Hits following the 
scheme: 
1. I have the cache that holds ids of documents in current category.
2. Each time I see doc id that is not in current category, I read that 
document and reload cache with it's category data. 

So if I found docs in N categories (N usually is not big), I really need to 
read exactly N docs from disk, the rest of iterating through Hits is just 
checking cache (because I sort by category).

It's a pity lucene doesn't have IndexSearcher.search( Query, Sort, 
HitCollector ), but if I understood Hits properly, it gives me O( log2
( doc_dum ) ) performance impact per resultset, which is perfectly 
acceptable.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Document numbers and ids

2005-02-04 Thread Simeon Koptelov
> By "renumbered", it means it squeezes out holes left by deletes.  The 
> actual order does not change and thus does not affect a Sort.INDEXORDER 
> sort.
> 
> Documents are stored in the index in the order that they were indexed - 
> nothing changes this order.  Document id's are not permanent if deletes 
> occur followed by an optimize.

Thanks for clarification, Erik. Could you answer one more question: can I 
control the assignment of document numbers during indexing? It would be very 
handy for me to have categories of documents aligned on some boudaries, e.g. 
category N numbers start on  N*1. Obviously, there will be some holes in 
numeration with this scheme.

Maybe I should explain, why I'm asking. 
I'm searching for documents, but for most (almost all) of them I don't really 
care about their content. I only want to know a particular numeric field from 
document (id of document's category). 
I also need to know how many docs in category were found, so I can't index 
categories instead of docs. 
The result set can be pertty big (30K) and all must be handled in inner loop. 
So I wanna use HitCollector and assign intervals of ids to categories of 
documents. Following this way, there's no need to actually retrieve document 
in inner loop. 

Am I on the right way?

Mood: wondering, why SQL GROUP BY works so fast.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document numbers and ids

2005-02-04 Thread Simeon Koptelov
The LiA says that I can use Sort.INDEXORDER when indexing order is relevant 
and gives an example where documents' ids (got from Hits.id() ) are 
increasing from top to bottom of resultset. Are that ids the same thing as 
document numbers? 

If they are the same, how can it be that they are preserved during indexing 
process? LiA says that documents are renumbered when merging segments.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Search on heterogenous index

2005-01-25 Thread Simeon Koptelov
>Heterogenous Documents/indices are OK - check out the second hit:
>
>  http://www.lucenebook.com/search?query=heterogenous+different

Thanks, I'll consider buying "Lucene in Action".

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search on heterogenous index

2005-01-21 Thread Simeon Koptelov
Hello all. I'm new to lucene and think about using it in my project.

I have prices with dynamic structure, containing wares there, about 10K prices 
with total 500K wares. Each price has about 5 text fields. 

I'll do searches on wares. The difficult part is that I'll do searches for all 
wares, the search is not bound to a particular price structure.

My question is, how should I organize my indices? Can Lucene handle data 
effectlively if I'll have one index containing different Fields in Documents? 
Or should I create a separate index for each price with same Fields structure 
across Documents?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]