Re: What doc id to use on IndexReader with SetNextReader

2011-04-18 Thread Antony Bowesman

Thanks Uwe, I assumed as much.

On 18/04/2011 7:28 PM, Uwe Schindler wrote:

Document d = reader.document(doc)


This is the correct way to do it.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



What doc id to use on IndexReader with SetNextReader

2011-04-18 Thread Antony Bowesman

Migrating some code from 2.3.2 to 2.9.4 and I have custom Collectors.

Now there are multiple calls to collect and each call needs to adjust the passed 
doc id by docBase as given in SetNextReader.


However, if you want to fetch the document in the collector, what 
docId/IndexReader combination should be used.


Given that

collect(int doc)
setNextReader(IndexReader reader, int docBase)

I have tested the following two which seem to get the same document

Document d = searcher.getIndexReader.document(doc + docBase)
Document d = reader.document(doc)

Is this guaranteed to always be the case and how the APIs should be used?

Thanks
Antony

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



NullPointerException in FieldSortedHitQueue

2011-04-14 Thread Antony Bowesman

Upgrading from 2.3.2 to 2.9.4 I get NPE as below

Caused by: java.lang.NullPointerException
	at 
org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:224)

at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)
	at 
org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:176)

at 
org.apache.lucene.search.FieldSortedHitQueue.(FieldSortedHitQueue.java:56)

I have a SortField that is created something like

new SortField(fieldName, comparator)

which generates a custom SortField, but in the call to Comparators.get() fails 
because createValue in FieldSortedHitQueue calls


case SortField.CUSTOM:
  comparator = factory.newComparator (reader, fieldname);

and factory is null.

Is this a bug?  I know FSHQ is deprecated, but presumably it should still work 
with a SortField containing a comparator?


Thanks
Antony

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Index time boost question

2011-04-14 Thread Antony Bowesman
I have a test case written for 2.3.2 that tested an index time boost on a field 
of 0.0F and then did a search using Hits and got 0 results.


I'm now in the process of upgrading to 2.9.4 and am removing all use of Hits in 
my test cases and using a Collector instead.  Now the test case fails as it gets 
one result with a score of 0.0.


Is it correct that the Collector will collect hits with 0 score.  If so, I need 
to change my test case.


Thanks

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



DocIdSet to represent small numberr of hits in large Document set

2011-04-04 Thread Antony Bowesman

I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).

Many of our indexes are 5M+ Documents, however, only a small subset of these are 
relevant to any user.  As a DocIdSet, backed by a BitSet or OpenBitSet, is 
rather inefficient in terms of memory use, what is the recommended way to 
DocIdSet implementation to use in this scenario?


Seems like SortedVIntList can be used to store the info, but it has no methods 
to build the list in the first place, requiring an array or bitset in the 
constructor.


I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2 deployment, 
but want to move away from that Nutch dependency, so wondered if Lucene had a 
way to do this?


Thanks

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



TopFieldDocCollector and v3.0.0

2009-12-07 Thread Antony Bowesman

I'm on 2.3.2 and looking to move to 2.9.1 or 3.0.0

In 2.9.1 TopFieldDocCollector is

"Deprecated. Please use TopFieldCollector instead."

in 3.0.0 TopFieldCollector says

NOTE: This API is experimental and might change in incompatible ways in the next 
release


What is the suggested path for migrating TopFieldDocCollector usage?

Antony



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



NumberFormatException when creating field cache

2009-09-09 Thread Antony Bowesman
I'm using Lucene 2.3.2 and have a date field used for sorting, which is 
MMDDHHMM.  I get an exception when the FieldCache is being generated as follows:


java.lang.NumberFormatException: For input string: "190400-412317"
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:412)
at java.lang.Long.parseLong(Long.java:461)
org.apache.lucene.search.ExtendedFieldCacheImpl$1.parseLong(ExtendedFieldCacheImpl.java:18)
org.apache.lucene.search.ExtendedFieldCacheImpl$3.createValue(ExtendedFieldCacheImpl.java:53)
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
org.apache.lucene.search.ExtendedFieldCacheImpl.getLongs(ExtendedFieldCacheImpl.java:36)
org.apache.lucene.search.ExtendedFieldCacheImpl.getLongs(ExtendedFieldCacheImpl.java:30)
org.apache.lucene.search.FieldSortedHitQueue.comparatorLong(FieldSortedHitQueue.java:254)
org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:194)
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:168)
org.apache.lucene.search.FieldSortedHitQueue.(FieldSortedHitQueue.java:56)

I'm not able to get onto the server that has the index DB at the moment, but I 
expect my data is corrupt in the index.  That may have been because I have not 
validated certain data given by a 'trusted' source, however, the problem now is 
that assuming that data exists as it is, I am then unable to ever sort on the 
date field.


It maybe that the original data for the Document is no longer available, so 
deleting and re-creating may not be an option.


Would it be useful to allow some sort of data tolerance when creating these 
caches?  At least now the only solution is to delete that Document.  Perhaps the 
values could then be returned as 0 in the Parser implementations for numeric 
failures.


Antony



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TermEnum with deleted dccuments

2009-05-10 Thread Antony Bowesman

Hi Mike,

Thanks for the response.

I looked at that issue, but my case is trivial to fix.  I just keep the Set of 
terms I have deleted and ignore those during my second interation.


Thanks
Antony



Michael McCandless wrote:

This is known & expected.

Lucene does not update the terms dictionary (meaning which terms are
in the index, and their frequency) in response to deleted docs.

It does update TermDocs enumeration, ie once you get the TermDocs for
a given term and step through its docs, the deleted docs will not be
returned.

One workaround is to call IndexWriter.expungeDeletes, but that's a
costly operation (forces merges of any segments containing deletes).

https://issues.apache.org/jira/browse/LUCENE-1613 was opened to gather
use cases / issues on this... if this is impacting your application,
can you post some details to that issue?

Mike

On Thu, May 7, 2009 at 1:04 AM, Antony Bowesman  wrote:

I am merging Index A to Index B.  First I read the terms for a particular
field from index A and some of the documents in A get deleted.

I then enumerate the terms on a different field also in index A, but the
terms from the deleted document are still present.

The termEnum.docFreq() also returns > 0 for those terms even though the docs
are deleted.

Should this be the case?  I have tried closing the reader between
enumerations, but no difference.

Antony




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



TermEnum with deleted dccuments

2009-05-06 Thread Antony Bowesman
I am merging Index A to Index B.  First I read the terms for a particular field 
from index A and some of the documents in A get deleted.


I then enumerate the terms on a different field also in index A, but the terms 
from the deleted document are still present.


The termEnum.docFreq() also returns > 0 for those terms even though the docs are 
deleted.


Should this be the case?  I have tried closing the reader between enumerations, 
but no difference.


Antony




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman

Thanks for that info.  These indexes will be large, in the 10s of millions.
 id field is unique and is 29 bytes.  I guess that's still a lot of data to
trawl through to get to the term.


Have you tested how long it takes to look up docs from your id?


Not in indexes that size in a live environment as I don't have the hardware to 
make those sorts of test :( although I know in general, lookup is fast.



Couldn't you just give the base & full docs different ids?  Then you
can independently choose which one to update?


I considered that, but as the normal case will not need to worry about this 
scenario.


There is only ever one instance of a mail Doc, whether it is a root mail or part 
of a forward chain and a root mail can of course be part of a forward chain at 
some point, so it should be optimal to just fetch the one Document for the mail 
Id without first trying the true Id, then some pseudo Id if it isn't found.


Unfortunately, I'm having to solve this problem in my Lucene app as the tool 
that's generating this data is unable to know what has or has not been handled 
previously.


I'm implementing it using the IndexReader approach for now and will try to get 
some performance data, so thanks for your comments Mike.


Antony








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman

Michael McCandless wrote:

Lucene doesn't provide any way to do this, except opening a reader.

Opening a reader is not "that" expensive if you use it for this
purpose.  EG neither norms nor FieldCache will be loaded if you just
enumerate the term docs.


Thanks for that info.  These indexes will be large, in the 10s of millions.  id 
field is unique and is 29 bytes.  I guess that's still a lot of data to trawl 
through to get to the term.



But, you can let Lucene do the same thing for you by just always using
updateDocument, which'll remove the old doc if it's present.


That's precisely what I don't want to occur.  I have two forms of a Document, 
which represent mail items.  One 'full' version containing all index and stored 
data, which represents a searchable mail item and one 'base', which is simply a 
marker Document which represents a mail in a forwarded mail chain, with just a 
couple of stored fields containing the mail meta data.


Under normal circumstances there are no problems as mails arrive in sequence and 
are never handled twice, but there is one case, during a reindex op, when the 
arrival of those mails can come out of sequence, i.e. a full mail is indexed 
first, but that mail is later processed as part of a forwarded mail chain of 
another mail.


It is the second time that mail is handled as a base mail that I do not want it 
to overwrite the full version.


Would it be technically difficult to support something like this in the 
IndexWriter API and if not, would it end up being more efficient that using a 
reader/terms to check this?


Antony





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Which is more efficient

2009-05-05 Thread Antony Bowesman

Just wondered which was more efficient under the hood

 for (int i = 0; i < size; i++)
 terms[i] = new Term("id", doc_key[i]);

This

 writer.deleteDocuments(terms);
 for (int i = 0; i < size; i++)
 writer.addDocument(doc[i]);

Or this

 for (int i = 0; i < size; i++)
 writer.updateDocument(terms[i], doc[i]);


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman
I'm adding Documents in batches to an index with IndexWriter.  In certain 
circumstances, I do not want to add the Document if it already exists, where 
existence is determined by field id=myId.


Is there any way to do this with IndexWriter or do I have to open a reader and 
look for the term id:XXX?  Given that opening a reader is expensive, is there 
any way to do this efficiently?


I guess what I want is

IndexWriter.addDocumentIfMissing(Term term, Document doc, Analyzer analyzer)

Thanks
Antony




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 2.4 - Searching

2009-01-27 Thread Antony Bowesman

Karl Heinz Marbaise wrote:


I have a field which is called filename and contains a filename which 
can of course be lowercase or upppercase or a mixture...


I would like to do the following:

+filename:/*scm*.doc

That should result in getting things like

/...SCMtest.doc
/...scmtest.doc
/...scm.doc
etc.

May be someone can give me hint how to solve this...


It's all down to the analyzer you use when you index that field and how you 
choose to tokenize it.  If you want to always search case insensitively, then 
you should lower case the filename when indexing.


Depending on how you implemented your query parser, if you have implemented 
wildcard query support, if it's anything like the standard QP, it will not put 
the query string through the analyzer, so a search for


+filename:/*SCm*.doc

would then not find anything, so you'd need to make sure you lower case all the 
filename field searches at some point.


I use a custom analyzer for filenames, which lower cases and tokenizes by letter 
or digit or any custom chars and my query parser supports custom analyzers for 
getFieldQuery().


If you want to keep the original filename, then just store the field as well as 
index it, then you can get the original back from the Document.


Antony


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: addIndexesNoOptimize question

2008-12-19 Thread Antony Bowesman

Thanks Mike, I'm still on 2.3.1, so will upgrade soon.
Antony


Michael McCandless wrote:
This was an attempt on addIndexesNoOptimize's part to "respect" the 
maxMergeDocs (which prevents large segments from being merged) you had 
set on IndexWriter.


However, the check was too pedantic, and was removed as of 2.4, under 
this issue:


https://issues.apache.org/jira/browse/LUCENE-1254

Mike




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



addIndexesNoOptimize question

2008-12-17 Thread Antony Bowesman

The javadocs state

"This requires ... and the upper bound* of those segment doc counts not exceed 
maxMergeDocs."


Can one of the gurus please explain what that means and what needs to be done to 
find out whether an index being merged fits that criteria.


Thanks
Antony




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which is faster/better

2008-11-25 Thread Antony Bowesman

Michael McCandless wrote:


If you have nothing open already, and all you want to do is delete
certain documents and make a commit point, then using IndexReader vs
IndexWriter should show very little difference in speed.


Thanks.  This use case can assume there may be nothing open.  I prefer 
IndexWriter as delete=write is a much clearer concept that delete=read...



As of 2.4, IndexWriter now provides delete-by-Query, which I think
ought to meet nearly all of the cases where someone wants to
delete-by-docID using IndexReader.


Yes, that is an excellent addition.  Up to now, our only use case for 
delete-by-docId is to perform a dBQ and so far, we have been using your 
suggestion from last year about how to do delete documents for ALL terms.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Which is faster/better

2008-11-24 Thread Antony Bowesman
In 2.4, as well as IndexWriter.deleteDocuments(Term) there is also 
IndexReader.deleteDocuments(Term).


I understand opening a reader is expensive, so does this means using 
IndexWriter.deleteDocuments would be faster from a closed index position?


As the IndexReader instance is newer, it has better Javadocs, so it's unclear 
which is the 'right' one to use.


Any pointers?
Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: distinct field values

2008-10-14 Thread Antony Bowesman

Akanksha Baid wrote:
I have indexed multiple documents - each of them have 3 fields ( id, tag 
, text). Is there an easy way to determine the set of tags for a given 
query without iterating through all the hits?
For example if I have 100 documents in my index and my set of tag = {A, 
B, C}. Query Q on the text field returns 15 docs with tag A , 10 with 
tag B and none with tag C (total of 25 hits). Is there a way to 
determine that the set of tags for query Q = {A, B} without iterating 
through all 25 hits.


Another way is to use a HitCollector to collect all the hits into a Map and then 
use TermEnum + TermDocs to walk the tags / docs and see what tag the hit comes 
from.  This would be different to walking the Hits/Documents to fetch the tag 
from the Document.  Not sure if this is very efficient though, depends on the 
Document count.


Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Phrase Query

2008-09-16 Thread Antony Bowesman

Is it possible to write a document with different analyzers in different fields?


PerFieldAnalyzerWrapper




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Caching Filters and docIds when using MultiSearcher/IndexSearcher(MultiReader)...

2008-09-11 Thread Antony Bowesman
Up to now I have only needed to search a single index, but now I will have many 
index shards to search across.  My existing search mantained cached filters for 
the index as well as a cache of my own unique ID fields in the index, keyed by 
Lucene DocId.


Now I need to search multiple indices, I am trying to work out how to continue 
to use these caches.


I have one index per month of data (up to 10M docs per month) and users can 
search across whichever date range they want, so one search may search Index 
1-->12 (e.g. Jan07-Dec07) and another 13-20 (Jan08-Aug08).


It makes no sense to cache a single bitset generated from a MultiReader over 
indices 1-12 when the next search could be for indices 2-11 and all the bits 
would be useless, so to be of any use, caches, including cached BitSets should 
therefore contain the doc ids specific to the particular index rather than to 
any particular MultiReader.  Then my Filter implementation can determine the 
real doc id and delegate to a bitset for the particular reader instance.


This means I need to find the original reader/searcher instance and the 
particular doc Id from that instance to perform bitset checks or cache lookups.


In the MultiSearcher there is subDoc and subSearcher, but there's no such beast 
for an IndexReader to find the real reader/doc from the pseudo one.


This also raises the question about MultiSearcher vs IndexSearcher(MultiReader) 
which, even after reading the the archives, I am unsure which I should use - 
there seem to be comments in the dev list to avoid MultiSearcher...


Any thoughts or have I spiralled too far into Lucene's depths to see where I 
am...?

Antony






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Merging indexes - which is best option?

2008-09-08 Thread Antony Bowesman

Thanks Karsten,


I decided first to delete all duplicates from master(iW) and then to insert
all temporary indices(other).


I reached the same conclusion.  As your code shows, it's a simple enough 
solution.  You had a good point with the iW.abort() in the rollback case.


Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Merging indexes - which is best option?

2008-09-04 Thread Antony Bowesman
I am creating several temporary batches of indexes to separate indices and 
periodically will merge those batches to a set of master indices.  I'm using 
IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the master 
may already contain the index for that document and I get a duplicate.


Duplicates are prevented in the temporary index, because when adding Documents, 
I call IndexWriter#deleteDocuments(Term) with my UID, before I add the Document.


I have two choices

a) merge indexes then clean up any duplicates in the master (or vice versa). 
Probably IndexWriter.deleteDocuments(Term[]) would suit here with all the UIDs 
of the incoming documents.


b) iterate through the Documents in the temporary index and add them to the 
master

b sounds worse as it seems an IndexWriter's Analyzer cannot be null and I guess 
there's a penalty in assembling the Document from the reader.


Any views?
Antony







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Javadoc wording in IndexWriter.addIndexesNoOptimize()

2008-09-04 Thread Antony Bowesman

The Javadoc for this method has the following comment:

"This requires this index not be among those to be added, and the upper bound* 
of those segment doc counts not exceed maxMergeDocs. "


What does the second part of that mean, which is especially confusing given that 
MAX_MERGE_DOCS is deprecated.


Thanks
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can TermDocs.skipTo() go backwards

2008-08-27 Thread Antony Bowesman

Michael McCandless wrote:


Ahh right, my short term memory failed me ;)  I now remember this thread.


Excused :) I expect you have real work to occupy your mind!


Yes, though LUCENE-1231 (column stride stored fields) should help this.


I see from JIRA that MB has started working on this - It's marked as 3.0, but 
there was some hope for a 2.4 release.  Are there any estimates for when this 
might get to a release - this is an exciting development for me.


Thanks
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can TermDocs.skipTo() go backwards

2008-08-27 Thread Antony Bowesman

Michael McCandless wrote:


TermDocs.skipTo() only moves forwards.

Can you use a stored field to retrieve this information, or do you 
really need to store it per-term-occurrence in your docs?


I discussed my use case with Doron earlier and there were two options, either to 
use payloads or stored fields.  With the payload case, for a single field 
(owner) in a document there are multiple unique terms (ownerId), each with a 
payload (access Id).


Using stored fields I have to store something like

ownerId:accessId
ownerId:accessId
ownerId:accessId

then fetch the stored field for the document and then find the particular 
accessId for the owner I am searching for.


I was testing the performance implications of each as I understand fetching 
stored fields is not optimal and the payload scenario is logically a better fit, 
as every owner will have a different accessId for every Document.


What would fit my usage would be something like

byte[] b = doc.getPayload("owner", ownerId);

where for the given OID, I can retrieve the payload I associated with it, when 
I did

doc.add(new Field("owner", ownerId, accessPayload);

but that's not how it works :(

Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Can TermDocs.skipTo() go backwards

2008-08-27 Thread Antony Bowesman
I have a custom TopDocsCollector and need to collect a payload from each final 
document hit.  The payload comes from a single term in each hit.


When collecting the payload, I don't want to fetch the payload during the 
collect() method as it will make fetches which may subsequently be bumped from 
the topDocs, so I want to fetch it during the topDocs() call.


I made some performance tests on a simple index of 5M documents.  If I do

reader.termPositions(term);
termPositions.skipTo(scoreDoc.doc);

it takes up to 282 ms just to make the skipTo.

The javadocs imply that skipTo() can only go forwards and as scoreDocs is in 
score order, not docId order, I suppose it's not possible to just use


termPositions.skipTo(scoreDoc.doc);

unless skipTo() can go both backwards.  Can it?  Javadocs imply there is more 
than one type of implementation.


If not I suppose I must resort the scoreDocs by docId order and then loop with 
termPositions.skipTo(scoreDoc.doc).  The number of hits will be typically small 
so it'll be fast enough.


Antony






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple index performance

2008-08-18 Thread Antony Bowesman

[EMAIL PROTECTED] wrote:

Thanks Anthony for your response, I did not know about that field.


You make your own fields in Lucene, it is not something Lucene gives you.



But still I have a problem and it is about privacy. The users are concerned
about privacy and so, we thought we could have all their files in a folder
and encrypt the whole folder and index with a user key, so then when user
logs in, decrypt the folder with the key and so Lucene can reach the
documents, so that is why I am concerned about efficiency, since I do not
know if Lucene could handle the 10,000 indexes.



It seems like you may be confusing what Lucene will give you.  The original file 
content and the Lucene indexes are two different things.  It sounds like you 
want to protect access to the original content on some shared storage, but that 
is not related to the searching provided by your Lucene app, or maybe I 
misunderstood your use case.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple index performance

2008-08-18 Thread Antony Bowesman

Cyndy wrote:


I want to keep user text files indexed separately, I will have about 10,000
users and each user may have about 20,000 short files, and I need to keep
privacy. So the idea is to have one folder with the text files and  index
for each user, so when search will be done, it will be pointing to the
corresponding file directory. Would this approach hit performance? is this a
good solution? Any recommendation?


For access control, we use an ownerId field in Lucene which indexes the owning 
user.  We filter all searches using ownerId.  This allows all Documents to be 
kept in a single index.


We also support sharding across multiple index files for performance/scaling 
considerations, via a hash of the ownerId, but in practice have not needed it. 
Much will depend on your search usage.


YMMV
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-18 Thread Antony Bowesman

Doron Cohen wrote:

The API definitely doesn't promise this.
AFAIK implementation wise it happens to be like this but I can be wrong and
plus it might change in the future. It would make me nervous to rely on
this.



I made some tests and it 'seems' to work, but I agree, it also makes me nervous 
to rely on empirical evidence for the design rather than a clearly documented API!




Anyhow, for your need I can think of two options:

Option 1:  just index the owenerID, do not store it, do not index or store
accessID (unless you wish to search by it, in this case just index it). In
addition store a dedicated mapping field that maps from ownerID to accessID.
E.g. with serialization of HashMap or something thinner. At runtime retrieve
this map from the document and it has all that information.



Hey that's an interesting idea!  I'd not considered storing the mapping, only 
re-creating it from fields at runtime.  I'll explore this.




Option 2: as you describe above, just index the ownerID with accessID as
payload, and then for the hitting docid of interest use termPositions to get
the payload, i.e. something like:
TermPositions tp = reader.termPositions();
tp.seek(new Term("ownerID",oid));
tp.skipTo(docid);
tp.nextPosition();
if (tp.isPayloadAvailable()) {
  byte [] accessIDBytes = tp.getPayload(...);
  ...


Yes, I was playing with this technique yesterday.  It's not easy to determine 
the performance implications of this method.  I will be using caches, but my 
volumes are potentially so large that I may never be able to cache everything 
(perhaps 500M Docs), so this has to be very quick.


I'll play with both approaches and see which works best.

Thanks for you time and I appreciate your valuable insight Doron.
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-17 Thread Antony Bowesman

I assume you already know this but just to make sure what I meant was clear
- on tokenization but still indexing just means that the entire field's text
becomes a single unchanged token. I believe this is exactly what
SingleTokenTokenStream can buy you - a single token, for which you can pre
set a payload.


Yes, I was with you :)



It is.  Field maintains its  value and it is either string/stream/etc. Once
you set it to tokenStream the string value is lost and there's no way to
store it.


Thanks for that - I delved a little further into FieldsWriter and see what you 
mean.




How about adding this field in two parts, one part for indexing with the
payload and the other part for storing, i.e. something like this:

Token token = new Token(...);
token.setPayload(...);
SingleTokenTokenStream ts = new SingleTokenTokenStream(token);

Field f1 = new Field("f","some-stored-content",Store.YES,Index.NO);
Field f2 = new Field("f", ts);


Now that got me thinking and I have exposed a rather large misconception in my 
understanding of the Lucene internals when consider fields of the same name.


Your idea above looked like a good one.  However, I realise I am probably trying 
to use payloads wrongly.  I have the following information to store for a single 
Document


contentId - 1 instance
ownerId 1..n instances
accessId 1..n instances

One ownerId has a corresponding accessId for the contentId.

My search criteria are ownerId:XXX + user criteria.  When there is a hit, I need 
the contentId and the corresponding accessId (for the owner) back.  So, I wanted 
to store the accessId as a payload to the ownerId.


This is where I came unstuck.  For 'n=3' above, I used the 
SingleTokenTokenStream as you suggested with the accessId as the payload for 
ownerId.  However, at the Document level, I cannot get the payloads from the 
field so, in trying to understand fields with the same name, I discovered that 
there is a big difference between


(a)
Field f = new Field("ownerId", "OID1", Store.YES, Index.NO_NORMS);
f = new Field("ownerId", "OID2", Store.YES, Index.NO_NORMS);
f = new Field("ownerId", "OID3", Store.YES, Index.NO_NORMS);

and (b)
Field f = new Field("ownerId", "OID1 OID2 OID3", Store.YES, Index.NO_NORMS);

as Document.getFields("ownerId") for (a) will be 3 and for (b) it will be 1.

My question then is, if I do

for (int i = 0; i < owners; i++)
{
f = new Field("ownerId", oid[i], Store.YES, Index.NO_NORMS);
doc.add(f);
f = new Field("accessId", aid[i], Store.YES, Index.NO_NORMS);
doc.add(f);
}

then will the array elements for the corresponding Field arrays returned by

Document.getFields("ownerId")
Document.getFields("accessId")

**guarantee** that the array element order is the same as the order they were 
added?

Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads and tokenizers

2008-08-14 Thread Antony Bowesman
Thanks for your comments Doron.  I found the earlier discussions on the dev list 
(21/12/06), where this issue is discussed - my use case is similar to Nadav Har'El.


Implementing payloads via Tokens explicitly prevents the use of payloads for 
untokenized fields, as they only support field.stringValue().  There seems no 
way to override this.


My field is currently stored, so the tokenStream approach you suggested, 
(Lucene-580) will not work as it's theoretically only for non-stored fields.  In 
practice, I expect I can create a stored/indexed Field with a dummy string 
value, then use setValue(TokenStream).  At least I can have stored fields with 
Payloads using the analyzer/tokenStream route.  Is this illegal?


What if the Fieldable had a tokenValue(), in addition to the existing 
stream/string/binary/reader values, which could be used for untokenized fields 
and used in invertField()?


I'd rather stick with core Lucene than start making proprietary changes, but it 
seems I can't quite get to where I want to be without some quite cludgy code for 
a very simple use case :(


Antony



Doron Cohen wrote:

IIRC first versions of patches that added payloads support had this notion
of payload by field rather than by token, but later it was modified to be by
token only.

I have seen two code patterns to add payloads to tokens.

The first one created the field text with a reserved separator/delimiter
which was later identified by the analyzer who separated the payload part
from the token part, created the token and set the payload.

The other pattern was to create a field with a TokenStream. Can be done only
for non storable fields. Here you can create the token in advance, and you
have a SingleTokenStream (I think this is how it is called) to wrap it in
case it is a single token. Since the token is created in advance, there's no
analysis going on, and you can set the payload of that token on the spot.I
prefer this pattern - more efficient and elegant.

Doron



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Payloads and tokenizers

2008-08-13 Thread Antony Bowesman
I started playing with payloads and have been trying to work out how to get the 
data into the payload


I have a field where I want to add the following untokenized fields

A1
A2
A3

With these fields, I would like to add the payloads

B1
B2
B3

Firstly, it looks like you cannot add payloads to untokenized fields.  Is this 
correct?  In my usage, A and B are simply external Ids so must not be tokenized 
and there is always a 1-->1 relationship between them.


Secondly, what is the way to provide the payload data to the tokenizer.  It 
looks like I have to add a List/Map of payload data to a custom Tokenizer and 
Analyzer, which is then consumed each "next(Token)".  However, it would be nice 
if, in my use case, I could use some kind of construct like:


Document doc = new Document()
Field f = new Field("myField", "A1", Field.Store.NO, Field.Index.UNTOKENIZED);
f.setPayload("B1");
doc.add(f);

and avoid the whole unnecessary Tokenizer/Analyzer overhead and give support for 
payloads in untokenized fields.


It looks like it would be trivial to implement in DocumentsWriter.invertField(). 
 Or would this corrupt the Fieldable interface in an undesirable way?


Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Per user data store

2008-08-05 Thread Antony Bowesman

Ganesh - yahoo wrote:

Hello all,

Documents coressponding to multiple users are to be indexed. Each user is
going to search only his documents. Only Administrator could search all users
data.

Is it good to have one database for each User or to have only one database
for all Users? Which will be better?


I created a hybrid approach that supported 1..n databases based on a hash of the 
user's user Id.  This was to allow for the situation where a single database 
would not scale - at the time there was not good information about Lucene's 
performance with large data sets.


In practice, we are now using a single database with data for all users.  There 
is an 'ownerId' field with the unique user Id in every document.


> My opinion is to have one database for all users and to have field
> 'Username'. Using this field data will get filtered out and the search
> results will be served to the User. In this approach, whether Username should
> be part of boolean query or TermFilter will be the better approach?

The ownerId is used as a cached filter rather than always added to the query, so 
that only that user's documents influence the score.  If it is part of the 
query, the complete document set for other users will influence the hits for 
this user.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Modifying a document by updating a payloads?

2008-07-30 Thread Antony Bowesman

Hi Mike,

Unfortunately you will have to delete the old doc, then reindex a new 
doc, in order to change any payloads in the document's Tokens.


This issue:

https://issues.apache.org/jira/browse/LUCENE-1231

which is still in progress, could make updating stored (but not indexed) 
fields a much lower cost operation, but that's not for sure and it's not 
clear when that issue will be done.


Michael Busch's Apache Con (2006/7??) presentation summarized with the bullet

"Per-document Payloads – updateable"

Is making a document 'updatable' (in _some_ way) something still seen as a long 
term goal for Lucene?


As far as implementation is concerned, if a stored (not indexed) field may be 
updatable with 1231, is there some difficulty with making payloads, which from 
my understanding are attributed to a posting of an indexed field, updatable.  I 
guess they ultimately equate to the same thing - i.e. using a stored field to 
hold the document's "payload", but it would be an extra field to load.


Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Modifying a document by updating a payloads?

2008-07-30 Thread Antony Bowesman

I seem to recall some discussion about updating a payload, but I can't find it.

I was wondering if it were possible to use a payload to implement 'modify' of a 
Lucene document.  For example, I have an ID field, which has a unique ID 
refering to an external DB.  For example, I would like to store a short bitmap 
giving state information about aspects of the Document and this state could 
change during the life of the Document and be available to my searchers.


I've not yet played with payloads and I understand there is something in the 
pipeline about updating Documents, but is it possible to update a payload for an 
existing Document?


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Rebuilding parallel indexes

2008-06-09 Thread Antony Bowesman

Andrzej Bialecki wrote:


I have a thought ;) Perhaps you could use a FilteredIndexReader to maintain a
map between new IDs and old IDs, and remap on the fly. Although I think that
some parts of Lucene depend on the fact that in a normal index the IDs are
monotonically increasing ... this would complicate the issue.


Interesting thought!  I've not yet looked into the guts of the ParallelReader, 
but can imagine that it could work, but it sounds like an effective rewrite of 
ParallelReader.  Optimize would be a problem though as optimizing this index 
would then mean the mapping table would need recreation (I'm assuming the 
optimization would muck up the Ids if only the parallel index was optimized).


You'd also need to get the new doc Id for each doc that is added.  Are docIds 
allocated during addDocument or during the commit?


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Rebuilding parallel indexes

2008-06-09 Thread Antony Bowesman
I have a design where I will be using multiple index shards to hold approx 7.5 
million documents per index per month over many years.  These will be large 
static R/O indexes but the corresponding smaller parallel index will get many 
frequent changes.


I understand from previous replies by Hoss that the technique to handle this is 
to use parallel indexes where the parallel index gets rebuilt periodically with 
the changing data.


However, this 'periodically' needs to be quite frequent to try to provide 
responsive changes to the index, potentially several times a dat.  One problem 
is that there can be updates to any of the data in almost any month, so an 
update by a user to 120 documents, one document per month for 10 years, requires 
a full rebuild of the 120 index shards of 7.5m docs each...


I was wondering what the technical reasons were why a 'delete+add' could not 
allow the original docId to be re-used, thus keeping the two parallel indexes in 
sync without requiring a rebuild.


If this could be overcome, this would make this parallel index pattern so much 
more useful for large volume data sets.


Any thoughts
Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Binding lucene instance/threads to a particular processor(or core)

2008-04-21 Thread Antony Bowesman
That paper from 1997 is pretty old, but mirrors our experiences in those days. 
Then, we used Solaris processor sets to really improve performance by binding 
one of our processes to a particular CPU while leaving the other CPUs to manage 
the thread intensive work.


You can bind processes/LWPs to a CPU on Solaris with psrset.

The Solaris thread model in the late '90s was also a significant factor in 
performance of multi-threaded programs.  The default thread library in Solaris 8 
implemented a MxN unbound thread model (threads/LWPS).  In those days we found 
that it did not perform well, so used the bound thread model (i.e. 1:1) where a 
Solaris thread was bound permanently to an LWP.  That improved performance a 
lot.  In Solaris 8, Sun had what they called the 'alternate' thread library (T2) 
around 2000, which became the default library in Solaris 9, and implemented a 
1:1 model of Solaris threads to LWPs.  That new library had dramatic performance 
improvements over the old.


Some background info for Java and threading

http://java.sun.com/j2se/1.5.0/docs/guide/vm/thread-priorities.html

Antony


Glen Newton wrote:

I realised that not everyone on this list might be able to access the
IEEE paper I pointed-out, so I will include the abstract and some
paragraphs from the paper which I have included below.

Also of interest (and should be available to all): Fedorova et al.
2005. Performance of Multithreaded Chip Multiprocessors And
Implications For Operating System Design. Usenix 2005.
http://www.eecs.harvard.edu/margo/papers/usenix05/paper.pdf
"Abstract: We investigated how operating system design should be
adapted for multithreaded chip multiprocessors (CMT) – a new
generation of processors that exploit thread-level parallelism to mask
the memory latency in modern workloads. We
determined that the L2 cache is a critical shared resource on CMT and
that an insufficient amount of L2 cache can undermine the ability to
hide memory latency on these processors. To use the L2 cache as
efficiently as possible, we propose an L2-conscious scheduling
algorithm and quantify its performance potential. Using this algorithm
it is possible to reduce miss ratios in the L2 cache by 25-37% and
improve processor throughput by 27-45%."


From Lundberg, L. 1997:
Abstract: "The default scheduling algorithm in Solaris and other
operating systems may result in frequent relocation of threads at
run-time. Excessive thread relocation cause
poor memory performance. This can be avoided by binding threads to
processors. However, binding threads to processors may result in an
unbalanced load. By considering a previously obtained theoretical
result and by evaluating a set of multithreaded Solaris
programs using a multiprocessor with 8 processors, we are able to
bound the maximum performance loss due to binding threads, The
theoretical result is also recapitulated. By evaluating another set of
multithreaded programs, we show that the gain of binding threads to
processors may be substantial, particularly for programs with fine
grained parallelism."

First paragraph: "The thread concept in Solaris [3][5] and other
operating systems makes it possible to write multithreaded programs,
which can be executed in parallel on a multiprocessor. Previous
experience from real world programs [4] show that, using the default
scheduling algorithm in Solaris, threads are frequently relocated from
one processor
to another at run-time. After each such relocation, the code and data
associated with the relocated thread is moved from the cache memory of
the 0113 processor to the cache of the new processor. This reduces the
performance. Run-time relocation of threads to processors can also
result in unpredictable response times. This is a problem in systems
which operate in a real-time environment. In order to avoid poor
memory performance and unpredictable real-time behaviour due to
frequent thread relocation, threads can be bound to processors using
the processor-bind directive [3] [5]. The major problem with binding
threads is that one can end up with an unbalanced load, i.e. some
processors may be extremely busy during some time periods while other
processors are idle."

-Glen

On 21/04/2008, Glen Newton <[EMAIL PROTECTED]> wrote:

And this discussion on bound threads may also shed light on things:
 
http://coding.derkeiler.com/Archive/Java/comp.lang.java.programmer/2007-11/msg02801.html


 -Glen


 On 21/04/2008, Glen Newton <[EMAIL PROTECTED]> wrote:
 > BInding threads to processors - in many situations - improves
 >  throughput by reducing memory overhead. When a thread is running on a
 >  core, its state is local; if it is timeshared-out and either 1)
 >  swapped back in on the same core, it is likely that there will be  the
 >  core's L1 cache; or 2) onto another core, there will not be a cache
 >  hit and the state will have to be fetched from L2 or main memory,
 >  incurring a performance hit, esp in the latter. See Lundberg, L. 1997.
 >  Evalu

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Antony Bowesman

Chris Hostetter wrote:
you can't ... that's why i said you'd need to rebuild the smaller index 
completley on a periodic basis (going in the same order as the docs in the 


Mmm, the annotations would only be stored in the index.  It would be possible to 
store them elsewhere, so I can investigate that, in which case the rebuild would 
be possible.


i can also imagine a situation where you break both indexes up into lots 
of pieces (shards) and use a MultiReader over lots of ParallelReaders ... 
that way you have much smaller "small" indexes to rebuild when someone 
annotates an email -- and if hte shards are organized by date, you're less 
likely to ever need to rebuild many of them since people will tend to 


Data will be 'sharded' anyway, by date of some granularity.  Looking at the 
source for MultiReader/MultiSearcher, they are single threaded.  Is there a 
performance trade off between single-thread/many small indexes and 
single-thread/some large indexes.  Can a MultiReader work with one..n reader per 
thread, something like a thread pool of IndexReaders.  I expect it would be 
faster to run the searches in parallel?


Disclaimer: all of this is purely brainstorming, i've never actually tried 
anything like this, it may be more trouble then it's worth.


:) Thanks for the sounding board - it's always useful to get new ideas!
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Antony Bowesman
Thanks all for the suggestions - there was also another thread "Lucene index on 
relational data" which had crossover here.


That's an interesting idea about using ParallelReader for the changable index. 
I had thought to just have a triplet indexed 'owner:mailId:label' in each Doc 
and have multiple Documents for the same mailId, e.g. if each recipient adds 
labels for the same mail, or if multiple labels are added by one recipient.  I 
would then have to make a join using mailId against the core.  However, if I 
want to use PR, I could have a single Document with multiple field, and using 
stored fields can 'modify' that Document.  However, what happens to the DocId 
when the delete+add occurs and how do I ensure it stays the same.


I'm on 2.3.1.  I seem to recall a discussion on this in another thread, but 
cannot find it.


Antony



Chris Hostetter wrote:

: The archive is read only apart from bulk deletes, but one of the requirements
: is for users to be able to label their own mail.  Given that a Lucene Document
: cannot be updated, I have thought about having a separate Lucene index that
: has just the 3 terms (or some combination of) userId + mailId + label.
: 
: That of course would mean joining searches from the main mail data index and

: the label index.

tangential to the existing follwups about ways to use Filters efficiently 
to get some of the behavior, take a look at ParallelReader ... your use 
case sounds like it might be perfect for it: one really large main dataset 
that changes fairly infrequently, and what changes do occur are mainly 
about adding new records; plus a small "parallel" set of fields about 
each record in the main set which do change fairly frequently.


you build up an index for the main data, and then you periodicly build up 
a second index with the docs in the exact same order as the main index.


additions to the main index do't need to block on rebuilding the secondary 
index.  deletes do (since you need to delete from both indexes in parallel 
to keep the ids in sync) ... but that's ok since you said you only need 
occasional bulk deletes (you could process them as an initial step of your 
recuring rebuild of the smaller index).




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Lucene partly as DB and 'joining' search results.

2008-04-11 Thread Antony Bowesman

Paul Elschot wrote:

Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme:



Use Filter and BitSet.
 From the personnal data, you build a Filter
(http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Fil
ter.html) wich is used in the main index.


With 1 billion mails, and possibly a Filter per user, you may want to
use more compact filters than BitSets, which is currently possible
in the development trunk of lucene.


Thanks for the pointers.  I've already used Solr's DocSet interface in my 
implementation, which I think is where the ideas for the current Lucene 
enhancements came from.  They work well to reduce the filter's footprint.  I'm 
also caching filters.


The intention is that there is a user data index and the mail index(es).  The 
search against user data index will return a set of mail Ids, which is the 
common key between the two.  Doc Ids are no good between the indexes, so that 
means a potentially large boolean OR query to create the filter of labelled 
mails in the mail indexes.  I know it's a theoretical question, but will this 
perform?


The read only data and modifiable user data need to be kept separate because the 
RO data can easily be re-created, which means I can't just create the filter as 
part of the base search.


Regards
Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using Lucene partly as DB and 'joining' search results.

2008-04-11 Thread Antony Bowesman
We're planning to archive email over many years and have been looking at using 
DB to store mail meta data and Lucene for the indexed mail data, or just Lucene 
on its own with email data and structure stored as XML and the raw message 
stored in the file system.


For some customers, the volumes are likely to be well over 1 billion mails over
10 years, so some  partitioning of data is needed.  At the moment the thoughts
are moving away from using a DB + Lucene to just Lucene along with a file system
representation of the complete message.  All searches will be against the index 
then the XML mail meta data is loaded from the file system.


The archive is read only apart from bulk deletes, but one of the requirements is 
for users to be able to label their own mail.  Given that a Lucene Document 
cannot be updated, I have thought about having a separate Lucene index that has 
just the 3 terms (or some combination of) userId + mailId + label.


That of course would mean joining searches from the main mail data index and the 
label index.


Does anyone have any experience of using Lucene this way and is it a realistic 
option of avoiding the DB at all?  I'd rather the headache of scaling just 
Lucene, which is a simple beast, than the whole bundle of 'stuff' that comes 
with the database as well.


Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to improve performance of large numbers of successive searches?

2008-04-10 Thread Antony Bowesman

Chris McGee wrote:


These tips have significantly improved the time to build the directory and 
search it. However, I have noticed that when I perform term queries using 
a searcher many times in rapid succession and iterate over all of the hits 
it can take a significant time. To perform 1000 term query searches each 
with around 2000 hits it takes well over a minute. The time seems to vary 


If you are searching using Hits = searcher.search() then you should use a 
HitCollector, or the TopDocs method instead.  Iterating over Hits will cause the 
search to be remade every 100 hits.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search emails - parsing mailbox (mbox) files

2008-04-04 Thread Antony Bowesman

Subodh Damle wrote:

Is there any reliable implementation for parsing email mailbox files (mbox
format), especially large (>50MB) archives ? Even after searching lucene
mailing list archives, googling around, I couldn't find one. I took a look
at Apache James project which seems to offer some support , but couldn't
find much documentation about it.


Apache James' MIME4J is one parser and Javamail also can parse mail.  I found 
Javamail more intuitive, but have not tested either against a large mail set for 
reliability and performance.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Biggest index

2008-03-16 Thread Antony Bowesman

[EMAIL PROTECTED] wrote:

Yes of course, the answers to your questions are important too.
But no anwser at all until now :(


One example:

1.5 million documents
Approx 15 fields per document
DB is 10-15GB (can't find correct figure)
All on one machine.  No stats on search usage though.

We're about to embark on a 25-40M documents (email data) per annum, no deletes 
over 10 years.  Planning for index distribution, but haven't decided on the 
partitioning yet.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using RangeFilter

2008-01-24 Thread Antony Bowesman

vivek sar wrote:

I've a field as NO_NORM, does it has to be untokenized to be able to
sort on it?


NO_NORMS is the same as UNTOKENIZED + omitNorms, so you can sort on that.
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple searchers (Was: CachingWrapperFilter: why cache per IndexReader?)

2008-01-23 Thread Antony Bowesman

Toke Eskildsen wrote:


== Average over the first 50.000 queries ==
metis_flash_RAID0_8GB_i37_t2_l21.log - 279.6 q/sec
metis_flash_RAID0_8GB_i37_t2_l23.log - 202.3 q/sec
metis_flash_RAID0_8GB_i37_v23_t2_l23.log - 195.9 q/sec



== Average over the first 340.000 queries ==
metis_flash_RAID0_8GB_i37_t2_l21.log - 305.3 q/sec
metis_flash_RAID0_8GB_i37_t2_l23.log - 260.5 q/sec
metis_flash_RAID0_8GB_i37_v23_t2_l23.log - 294.1 q/sec


These are odd.  The last case in both of the above shows a slowdown compared to 
2.1 index and version and in the first 50K queries, the 2.3 index and version is 
even slower than 2.3 with 2.1 index.  It catches up in the longer result set.


Any ideas why that might be.  The shared searcher multiple threads is probably 
quite a common use case.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



DateTools UTC/GMT mismatch

2008-01-22 Thread Antony Bowesman

Hi,

I just noticed that although the Javadocs for Lucene 2.2 state that the dates 
for DateTools use UTC as a timezone, they are actually using GMT.


Should either the Javadocs be corrected or the code corrected to use UTC 
instead.

Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using RangeFilter

2008-01-21 Thread Antony Bowesman

vivek sar wrote:

I need to be able to sort on optime as well, thus need to store it .


Lucene's default sorting does not need the field to be stored, only indexed as 
untokenized.

Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene sorting case-sensitive by default?

2008-01-15 Thread Antony Bowesman

Erick Erickson wrote:


doc.add(
new Field(
"f",
"This is Some Mixed, case Junk($*%& With Ugly
SYmbols",
Field.Store.YES,
Field.Index.TOKENIZED));





prints "This is Some Mixed, case Junk($*%& With Ugly SYmbols"
yet still finds the document with a search for "junk" using
StandardAnalyzer.


Don't forget you can't sort on that as the field's tokenized, so although it's 
stored in the original and indexed as lower case into multiple tokens, you will 
get the RuntimeException from FieldCache.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how do I get my own TopDocHitCollector?

2008-01-10 Thread Antony Bowesman

Beard, Brian wrote:

Ok, I've been thinking about this some more. Is the cache mechanism
pulling from the cache if the external id already exists there and then
hitting the searcher if it's not already in the cache (maybe using a
FieldSelector for just retrieving the external id)?


I am warming searchers in background and each search has one or more query 
related caches.  The external Id cache is normally preloaded by simply iterating 
terms, e.g.


String field = fieldName.intern();
final String[] retArray = new String[reader.maxDoc()];
TermDocs termDocs = reader.termDocs();
TermEnum termEnum = reader.terms (new Term (field, ""));
try
{
do
{
Term term = termEnum.term();
if (term == null || term.field() != field)
break;
String termval = term.text();
termDocs.seek(termEnum);
while (termDocs.next())
{
retArray[termDocs.doc()] = termval;
}
}
while (termEnum.next());
}
finally
{
termDocs.close();
termEnum.close();
}
return retArray;

I do allow for a partial cache, in which case, as you suggest, the searcher uses 
a FieldSelector to get the external Id from the document which then is stored to 
cache.


Antony





-Original Message-
From: Beard, Brian [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 10, 2008 10:08 AM

To: java-user@lucene.apache.org
Subject: RE: how do I get my own TopDocHitCollector?

Thanks for the post. So you're using the doc id as the key into the
cache to retrieve the external id. Then what mechanism fetches the
external id's from the searcher and places them in the cache?


-----Original Message-
From: Antony Bowesman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 09, 2008 7:19 PM

To: java-user@lucene.apache.org
Subject: Re: how do I get my own TopDocHitCollector?

Beard, Brian wrote:

Question:

The documents that I index have two id's - a unique document id and a
record_id that can link multiple documents together that belong to a
common record.

I'd like to use something like TopDocs to return the first 1024

results

that have unique record_id's, but I will want to skip some of the
returned documents that have the same record_id. We're using the
ParallelMultiSearcher. 


I read that I could use a HitCollector and throw an exception to get

it

to stop, but is there a cleaner way?


I'm doing a similar thing.  I have external Ids (equivalent to yout
record_id), 
which have one or more Lucene Documents associated with them.  I wrote a
custom 
HitCollector that uses a Map to hold the so far collected external ids
along 
with the collected document.


I had to write my own priority queue to know when an object was dropped
of the 
bottom of the score sorted queue, but the latest PriorityQueue on the
trunk now 
has insertWithOverflow(), which does the same thing.


Note that ResultDoc extends ScoreDoc, so that the external Id of the
item 
dropped off the queue can be used to remove it from my Map.


Code snippet is somewhat as below (I am caching my external Ids, hence
the cache 
usage)


protected Map results;

public void collect(int doc, float score)
 {
 if (score > 0.0f)
 {
 totalHits++;
 if (pq.size() < numHits || score > minScore)
 {
 OfficeId id = cache.get(doc);
 ResultDoc rd = results.get(id);
 //  No current result for this ID yet found
 if (rd == null)
 {
 rd = new ResultDoc(id, doc, score);
 ResultDoc added = pq.insert(rd);
 if (added == null)
 {
 //  Nothing dropped of the bottom
 results.put(id, rd);
 }
 else
 {
 //  Return value dropped of the bottom
 results.remove(added.id);
 results.put(id, rd);
 remaining++;
 }
 }
 //  Already found this ID, so replace high score if
necessary
 else
 {
 if (score > rd.score)
 {
 pq.remove(rd);
 rd.score = score;
 pq.insert(rd);
 }
 }
 //  Readjust our minimum score again from the top entry
 minScore = pq.peek().score;
 }
 else
 remaining++;
 }
 }

HTH
Antony




Re: how do I get my own TopDocHitCollector?

2008-01-09 Thread Antony Bowesman

Beard, Brian wrote:

Question:

The documents that I index have two id's - a unique document id and a
record_id that can link multiple documents together that belong to a
common record.

I'd like to use something like TopDocs to return the first 1024 results
that have unique record_id's, but I will want to skip some of the
returned documents that have the same record_id. We're using the
ParallelMultiSearcher. 


I read that I could use a HitCollector and throw an exception to get it
to stop, but is there a cleaner way?


I'm doing a similar thing.  I have external Ids (equivalent to yout record_id), 
which have one or more Lucene Documents associated with them.  I wrote a custom 
HitCollector that uses a Map to hold the so far collected external ids along 
with the collected document.


I had to write my own priority queue to know when an object was dropped of the 
bottom of the score sorted queue, but the latest PriorityQueue on the trunk now 
has insertWithOverflow(), which does the same thing.


Note that ResultDoc extends ScoreDoc, so that the external Id of the item 
dropped off the queue can be used to remove it from my Map.


Code snippet is somewhat as below (I am caching my external Ids, hence the cache 
usage)


   protected Map results;

   public void collect(int doc, float score)
{
if (score > 0.0f)
{
totalHits++;
if (pq.size() < numHits || score > minScore)
{
OfficeId id = cache.get(doc);
ResultDoc rd = results.get(id);
//  No current result for this ID yet found
if (rd == null)
{
rd = new ResultDoc(id, doc, score);
ResultDoc added = pq.insert(rd);
if (added == null)
{
//  Nothing dropped of the bottom
results.put(id, rd);
}
else
{
//  Return value dropped of the bottom
results.remove(added.id);
results.put(id, rd);
remaining++;
}
}
//  Already found this ID, so replace high score if necessary
else
{
if (score > rd.score)
{
pq.remove(rd);
rd.score = score;
pq.insert(rd);
}
}
//  Readjust our minimum score again from the top entry
minScore = pq.peek().score;
}
else
remaining++;
}
}

HTH
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Antony Bowesman

Ariel wrote:


The problem I have is that my application spends a lot of time to index all
the documents, the delay to index 10 gb of pdf documents is about 2 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only takes 5
hours to index the same amount of pdfs documents. I would like to find out


If you are using log4j, make sure you have the pdfbox log4j categories set to 
info or higher, otherwise this really slows it down (factor of 10) or make sure 
you are using the non log4j version.  See 
http://sourceforge.net/forum/message.php?msg_id=3947448


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Deleting a single TermPosition for a Document

2008-01-08 Thread Antony Bowesman

Otis Gospodnetic wrote:

Is your user field stored?  If so, you cold find the target Document, get the
user field value, modify it, and re-add it to the Document (or something
close to this -- I am doing this with one of the indices on simpy.com and
it's working well).


No, it's not stored.  I'm not sure I understand how you 'modify it' as it's not 
possible to modify an existing Document or do you mean you fetch all the stored 
fields from the existing Document, delete the existing Document then add it back 
with the modified field?


I have ~20 fields per Document and most are not stored

Antony




Otis

-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message  From: Antony Bowesman <[EMAIL PROTECTED]> To:
java-user@lucene.apache.org Sent: Tuesday, January 8, 2008 12:47:05 AM 
Subject: Deleting a single TermPosition  for a

Document

I'd like to 'update' a single Document in a Lucene index.  In practice, this
 'update' is actually just a removal of a single TermPosition for a given
Term for a given doc Id.

I don't think this is currently possible, but would it be easy to change
Lucene to support this type of usage?

The reason for this is to optimise my index usage.  I'm using Lucene to index
 arbitrary data sets, however, in some data sets, each Document is indexed
once for each user who has an interest in the document.  For example, with 
mail data, a mail item (with a single recipient) is stored as two Documents,

once with the 'user' field set to the sender's user Id and again with the
user field set to the recipents's user Id.  Searches just filter mail for a
given user by the user field.

When one of those users deletes the mail, the Document with the 'user' field
is simply deleted.  One of the original reasons for doing this was to enable
 horizontal partitioning of the index.  This works nicely, but of course the
 index is bigger than necessary and the number of terms positions is at least
 double what is necessary.

I had thought to originally indexed the data once, with the user field set to
 the sender and recipient user Id, but when the sender or recipient deletes
the mail from their mailbox, searching becomes more complicated as the index
does not reflect the external database state unless the mail is reindexed.

Is this something other's have wanted or are there other solutions to this
problem?

Thanks Antony




- To
unsubscribe, e-mail: [EMAIL PROTECTED] For additional
commands, e-mail: [EMAIL PROTECTED]





- To
unsubscribe, e-mail: [EMAIL PROTECTED] For additional
commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Deleting a single TermPosition for a Document

2008-01-07 Thread Antony Bowesman
I'd like to 'update' a single Document in a Lucene index.  In practice, this 
'update' is actually just a removal of a single TermPosition for a given Term 
for a given doc Id.


I don't think this is currently possible, but would it be easy to change Lucene 
to support this type of usage?


The reason for this is to optimise my index usage.  I'm using Lucene to index 
arbitrary data sets, however, in some data sets, each Document is indexed once 
for each user who has an interest in the document.  For example, with mail data, 
a mail item (with a single recipient) is stored as two Documents, once with the 
'user' field set to the sender's user Id and again with the user field set to 
the recipents's user Id.  Searches just filter mail for a given user by the user 
field.


When one of those users deletes the mail, the Document with the 'user' field is 
simply deleted.  One of the original reasons for doing this was to enable 
horizontal partitioning of the index.  This works nicely, but of course the 
index is bigger than necessary and the number of terms positions is at least 
double what is necessary.


I had thought to originally indexed the data once, with the user field set to 
the sender and recipient user Id, but when the sender or recipient deletes the 
mail from their mailbox, searching becomes more complicated as the index does 
not reflect the external database state unless the mail is reindexed.


Is this something other's have wanted or are there other solutions to this 
problem?

Thanks
Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrency between IndexReader and IndexWriter

2007-12-09 Thread Antony Bowesman
Looks like I got myself into a twist for nothing - the reader will see a 
consistent view, despite what the writer does as long as the reader remains open.


Appologies for the noise...
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrency between IndexReader and IndexWriter

2007-12-09 Thread Antony Bowesman

Using Lucene 2.1

Antony Bowesman wrote:
My application batch adds documents to the index using 
IndexWriter.addDocument.  Another thread handles searchers, creating new 
ones as needed, based on a policy.  These searchers open a new 
IndexReader and there is currently no synchronisation between this 
action and any being performed by my writer threads.


I wanted to use the new writer.deleteDocuments and writer.updateDocument 
in the same phase as the addDocument, so I wrote some test cases to 
check the behaviour of using these in the same phase and found that an 
IndexReader opened on the index during this phase gives some odd values 
and this has upset my understanding of the concurrency issue...


For example, the following


Create IndexWriter
Loop IndexWriter.addDocument * count

Create IndexReader
Check numDocs
Check maxDoc
Close reader

Loop IndexWriter.deleteDocuments * count

Create IndexReader
Check numDocs
Check maxDoc
Close reader

Close IndexWriter


However, the numDocs shows interesting numbers with different values of 
count.


count = 2, numDocs = 0 after add, 0 after delete
count = 100, numDocs = 100 after add and 100 after delete
count = 127, numDocs = 120 after add, 120 after delete
count = 150, numDocs = 150 after add and 150 after delete
count = 1000, numDocs = 1000 after add and 0 after delete

I then checked how terms returned via a TermEnum were affected and these 
too also do not reflect the current state of a deleted document.


I know these numbers are affected by the 
DEFAULT_MAX_BUFFERED_DELETE_TERMS and DEFAULT_MERGE_FACTOR.


My question is therefore: how can actions by a reader determine the real 
state of a Document or Term if a writer is currently updating the 
index.  Using reader.isDeleted(docNum) shows an item is not deleted even 
though it has been, but not flushed.  reader.hasDeletions() also shows 
false.


My index app never actually uses reader.document() as it collects and 
caches Id terms using TermEnum when opening a reader and stale Ids are 
handled elsewhere, however, as it stands, the following logic


if (!reader.isDeleted(n))
doc = reader.document(n)

can fail with an IllegalArgumentException if the concurrent writer 
flushes in between the test and read.


Thanks
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Concurrency between IndexReader and IndexWriter

2007-12-09 Thread Antony Bowesman
My application batch adds documents to the index using IndexWriter.addDocument. 
 Another thread handles searchers, creating new ones as needed, based on a 
policy.  These searchers open a new IndexReader and there is currently no 
synchronisation between this action and any being performed by my writer threads.


I wanted to use the new writer.deleteDocuments and writer.updateDocument in the 
same phase as the addDocument, so I wrote some test cases to check the behaviour 
of using these in the same phase and found that an IndexReader opened on the 
index during this phase gives some odd values and this has upset my 
understanding of the concurrency issue...


For example, the following


Create IndexWriter
Loop IndexWriter.addDocument * count

Create IndexReader
Check numDocs
Check maxDoc
Close reader

Loop IndexWriter.deleteDocuments * count

Create IndexReader
Check numDocs
Check maxDoc
Close reader

Close IndexWriter


However, the numDocs shows interesting numbers with different values of count.

count = 2, numDocs = 0 after add, 0 after delete
count = 100, numDocs = 100 after add and 100 after delete
count = 127, numDocs = 120 after add, 120 after delete
count = 150, numDocs = 150 after add and 150 after delete
count = 1000, numDocs = 1000 after add and 0 after delete

I then checked how terms returned via a TermEnum were affected and these too 
also do not reflect the current state of a deleted document.


I know these numbers are affected by the DEFAULT_MAX_BUFFERED_DELETE_TERMS and 
DEFAULT_MERGE_FACTOR.


My question is therefore: how can actions by a reader determine the real state 
of a Document or Term if a writer is currently updating the index.  Using 
reader.isDeleted(docNum) shows an item is not deleted even though it has been, 
but not flushed.  reader.hasDeletions() also shows false.


My index app never actually uses reader.document() as it collects and caches Id 
terms using TermEnum when opening a reader and stale Ids are handled elsewhere, 
however, as it stands, the following logic


if (!reader.isDeleted(n))
doc = reader.document(n)

can fail with an IllegalArgumentException if the concurrent writer flushes in 
between the test and read.


Thanks
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: deleteDocuments by Term[] for ALL terms

2007-12-04 Thread Antony Bowesman

Thanks Mike, just what I was after.
Antony


Michael McCandless wrote:

You can just create a query with your and'd terms, and then do this:

  Weight weight = query.weight(indexSearcher);
  IndexReader reader = indexSearcher.getIndexReader();
  Scorer scorer = weight.scorer(reader);
  int delCount = 0;
  while(scorer.next()) {
reader.deleteDocument(scorer.doc());
delCount++;
  }

that iterates over all the docIDs without scoring them and without
building up a Hit for each, etc.

Mike

"Antony Bowesman" <[EMAIL PROTECTED]> wrote:

Hi,

I'm using IndexReader.deleteDocuments(Term) to delete documents in
batches.  I 
need the deleted count, so I cannot use IndexWriter.deleteDocuments().


What I want to do is delete documents based on more than one term, but
not like 
IndexWriter.deleteDocuments(Term[]) which deletes all documents with ANY
term. 
I want it to delete documents which have ALL terms, e.g.


Term("owner", "ownerUID") AND Term("subject", "something");

I have a new reader being used, so I could make a new IndexSearcher and
query 
the documents with all terms in a BooleanQuery and then iterate the
results, but 
that would either mean using the Hits mechanism or TopDocs and seems like
a 
heavyweight way to do things.


What I want to be able to do is to delete sequentially without storing up
a 
result set as I may want to delete all and ALL may be rather big.


I see the implementation uses reader.termDocs() to do the deletion for a
single 
Term, which of course is easy, but is there a simple way to make a
deletion for 
multiple terms with AND via the reader using, say termDocs, that will not 
potentially use large amounts of memory, or should I just go with the
searcher 
TopDocs mechanism and do that also in batches to avoid the risk of a
large 
memory hit.


I know there's lots of clever 'expert-mode' stuff under the Lucene API
hood, but 
does anyone know any good way to do this or have I missed anything
obvious in 
the API docs?


Thanks
Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



deleteDocuments by Term[] for ALL terms

2007-11-25 Thread Antony Bowesman

Hi,

I'm using IndexReader.deleteDocuments(Term) to delete documents in batches.  I 
need the deleted count, so I cannot use IndexWriter.deleteDocuments().


What I want to do is delete documents based on more than one term, but not like 
IndexWriter.deleteDocuments(Term[]) which deletes all documents with ANY term. 
I want it to delete documents which have ALL terms, e.g.


Term("owner", "ownerUID") AND Term("subject", "something");

I have a new reader being used, so I could make a new IndexSearcher and query 
the documents with all terms in a BooleanQuery and then iterate the results, but 
that would either mean using the Hits mechanism or TopDocs and seems like a 
heavyweight way to do things.


What I want to be able to do is to delete sequentially without storing up a 
result set as I may want to delete all and ALL may be rather big.


I see the implementation uses reader.termDocs() to do the deletion for a single 
Term, which of course is easy, but is there a simple way to make a deletion for 
multiple terms with AND via the reader using, say termDocs, that will not 
potentially use large amounts of memory, or should I just go with the searcher 
TopDocs mechanism and do that also in batches to avoid the risk of a large 
memory hit.


I know there's lots of clever 'expert-mode' stuff under the Lucene API hood, but 
does anyone know any good way to do this or have I missed anything obvious in 
the API docs?


Thanks
Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: efficient way to filter out unwanted results

2007-06-15 Thread Antony Bowesman

yu wrote:

Thanks Sawan for the suggestion.
I guess this will work for statically known doc ids. In my case, I know 
only external ids that I want to exclude from the result set.for each  
search.  Of course, I can always exclude these  docs in a post search 
process. I am curious if there are other more efficient approach.


When you open a searcher, you could create a cached array of all your external 
Ids with their Lucene DocId.  Using a custom HitCollector, which can be created 
with the Ids you wish to exclude, you can get a document's external Id during 
the collect() method using the docid.  Then just check the external Id of the 
matched document against the exclusion list.


As long as you have your searcher open, the cache will remain valid.
Antony






Thanks again for your help.

Jay

Sawan Sharma wrote:

Hello Jay,

I am not sure up to what level I understood your problem . But as far 
as my
assumption, you can try HitCollector class and its collect method. 
Here you

can get DocID for each hit and can remove while searching.

Hope it will be useful.

Sawan
(Chambal.com inc. NJ USA)





On 6/15/07, yu <[EMAIL PROTECTED]> wrote:


Hi everyone,

I am trying to remove several docs from search results each time I do
query. The docs can be identified by an exteranl ids whcih are
saved/indexed. I could use a Query or QueryFilter  to achieve this but
not sure if it's the most efficient way to do that.
Anyone has any experience or idea?
Thanks!

Jay

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Contains" query parsed to PrefixQuery

2007-06-11 Thread Antony Bowesman

It's a bug in 2.1, fixed by Doron Cohen

http://issues.apache.org/jira/browse/LUCENE-813

Antony


dontspamterry wrote:

Hi all,

I was experimenting with queries using wildcard on an untokenized field and
noticed that a query with both a starting and trailing wildcard, e.g. *abc*,
gets parsed to the PrefixQuery *abc. I did enable the leading wildcard in
the QueryParser to allow the query above to be parsed so I'm wondering is
there any way to get the query parser to parse *abc* as a WildcardQuery and
not a PrefixQuery, short of overriding the applicable query parser methods?
I don't think the parser treats every query with a trailing wildcard as a
PrefixQuery because when I tried, a*b*, it got parsed to a WildcardQuery.
Thanks for any insight!

-Terry



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can I search over all documents NOT in a certain subset?

2007-06-08 Thread Antony Bowesman

Hilton Campbell wrote:

Yes, that's actually come up.  The document ids are indeed changing which is
causing problems.  I'm still trying to work it out myself, but any help
would most definitely be appreciated.


If you have an application Id per document, then you could cache that field for 
each reader and when you open the new reader, create a new cache of the IDs for 
that reader and then re-evaluate the bitmap according to the changed Ids.


You may be able to optimise the case for two readers by calculating the mapping 
once and then use that for each bitmap that needs reevaluating.


Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can I search over all documents NOT in a certain subset?

2007-06-06 Thread Antony Bowesman

Steven Rowe wrote:

Conceptually (caveat: untested), you could:

1. Extend Filter[1] (call it DejaVuFilter) to hold a BitSet per
IndexReader.  The BitSet would hold one bit per doc[2], each initialized
to true.

2. Unset a DejaVuFilter instance's bit for each of your top N docs by
walking the TopDocs returned by Searcher.search(Query,Filter,int)[3].
Initially, you could pass in null for the Filter, and then for all
following calls, an instance of DejaVuFilter.


Just a thought...

If Hilton wants to be aware of new Documents in the index since the previous 
search, this requires opening a new IndexReader.


If only Documents have been added to the index I expect, but am not sure, that 
the bits from the old IndexReader are still valid for the document numbers in 
the new Reader.  However, if there have been deletions or optimisation has 
occurred between reader instances, then the document ids from the old reader may 
not represent the same documents in the new reader, so the Filter for the old 
reader will not be valid for the new search against the new reader and you may 
get false matches.


I don't think there will be a problem if there are no deletions.

Antony






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does Lucene search over memory too?

2007-05-29 Thread Antony Bowesman

Doron Cohen wrote:

Antony Bowesman <[EMAIL PROTECTED]> wrote on 28/05/2007 22:48:41:


I read the new IndexWriter Javadoc and I'm unclear about this
autocommit.  In
2.1, I thought an IndexReader opened in an IndexSearcher does not "see"
additions to an index made by an IndexWriter, i.e. maxDoc and
numDocs remain
constant, so the statement

"When autoCommit is true then every flush is also a commit (IndexReader
instances will see each flush as changes to the index). This is
the default, to
match the behavior before 2.2"

makes me wonder if my assumptions are wrong.  Can you clarify
what it means by
the IndexReader "seeing" changes to the index?


Antony, your assumptions were (and still are) correct - once
an index reader is opened it is unaffected by changes to the
underlying index. Would it be clearer if the javadoc said:
"(An IndexReader instance will see changes to the index caused
by flush operations that completed prior to opening that
IndexReader instance)."?


Thanks Doron, I understood "IndexReader instance" as one that was already open, 
as it's not possible to 'open' an existing object instance, the static open() 
methods create new instances.


Your explanation clarifies things.  So, the scenario where the IndexReader sees 
changes with autocommit=true is


BEGIN
IndexWriter - open
- add documents
  - some merges/flushes occur here
  IndexReader - open
  - sees index with results of earlier merges
IndexWriter - close
END

whereas with autocommit=false, the IndexReader would see the state of the index 
at BEGIN and any merges/flushes would not show.


Perhaps the text could then be something like

"(Changes to the index caused by flush operations will be visible to an 
IndexReader when it is opened prior to closing the IndexWriter)"


It would also be worth updating the isCurrent() Javadoc to clarify its 
behaviour.  Presumably, with autocommit=false, isCurrent() will always return 
true is only using the IndexWriter to add/delete documents.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does Lucene search over memory too?

2007-05-28 Thread Antony Bowesman

Michael McCandless wrote:

The "autoCommit" mode for IndexWriter has not actually been released
yet: you can only use it on the trunk.

It actually serves a different purpose: it allows you to make sure
your searchers do not see any changes made by the writer (even the
ones that have been flushed) until you call close.  It defaults to
autoCommit="true", which matches the current released Lucene
behavior.


I read the new IndexWriter Javadoc and I'm unclear about this autocommit.  In 
2.1, I thought an IndexReader opened in an IndexSearcher does not "see" 
additions to an index made by an IndexWriter, i.e. maxDoc and numDocs remain 
constant, so the statement


"When autoCommit is true then every flush is also a commit (IndexReader 
instances will see each flush as changes to the index). This is the default, to 
match the behavior before 2.2"


makes me wonder if my assumptions are wrong.  Can you clarify what it means by 
the IndexReader "seeing" changes to the index?


Thanks
Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: maxDoc and arrays

2007-05-24 Thread Antony Bowesman

Carlos Pita wrote:

Hi all,

Is there any guaranty that the maxDoc returned by a reader will be about 
the

total number of indexed documents?


It struck me in this thread was that there may be a misunderstanding of the 
relationship between numDocs/maxDoc and an IndexReader.


When an IndexReader is opened its maxDoc and numDocs will never change 
regardless of the additions or deletions to the index.  At least I've not been 
able to make them change in my test cases.


So, when adding a new document after a reader has been opened, this new document 
is not yet visible via the original reader, so if you are caching that array, 
you would not update that array as it relates to the reader on the index at the 
time the reader was opened.


When you open a new reader, the numDocs and maxDoc will reflect that addition. 
Same applies to deletions.  After opening the reader, you would need to 
regenerate you array cache.


As Hoss has said, this is pretty much what FieldCache does and it holds the 
caches keyed by the IndexReader.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory leak (JVM 1.6 only)

2007-05-16 Thread Antony Bowesman

Daniel Noll wrote:

On Tuesday 15 May 2007 21:59:31 Narednra Singh Panwar wrote:

try using -Xmx option with your Application. and specify maximum/ minimum
memory for your Application.


It's funny how a lot of people instantly suggest this.  What if it isn't 
possible?  There was a situation a while back where I said I had allocated it 
1.8GB, and someone *still* recommended this option. :-)


Yes, and if it is a memory leak, it only prolongs the terminal illness :)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Turning PrefixQuery into a TermQuery

2007-04-11 Thread Antony Bowesman

Steffen Heinrich wrote:
Normally an IndexWriter uses only one default Analyzer for all its 
tokenizing businesses. And while it is appearantly possible to supply 
a certain other instance when adding a specific document there seems 
to be no way to use different analyzers on different fields within 
one document.


Use the PerFieldAnalyzerWrapper.

http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html

It allows different analyzers to be used for different fields.
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Not able to search on UN_TOKENIZED fields

2007-04-09 Thread Antony Bowesman
You can either use KeywordAnalyzer with your QueryParser which will correctly 
handle UN_TOKENIZED fields, but that will use KeywordAnalyzer for all fields.


To use a field specific Analyzer you either need to use PerFieldAnalyzerWrapper 
and preload it with all possible fields and use that as the Analyzer for 
QueryParser.  Alternatively, override QueryParser's getFieldQuery() and then 
choose your Analyzer there based on the field being searched.


Antony

Ryan O'Hara wrote:

Hey Erick,

Thanks for the quick response.  I need a truly exact match.  What I 
ended up doing was using a TOKENIZED field, but altering the 
StandardAnalyzer's stop word list to include only the word/letter 'a'.  
Below is my searching code:


String[] stopWords = {"a"};
StandardAnalyzer sa = new StandardAnalyzer(stopWords);
QueryParser qp = new QueryParser("symbol", sa);
Query queryPhrase = qp.parse(symbol.toUpperCase());
Hits hits = searcher.search(queryPhrase);
String hit;
if(hits.length() > 0){
hit = hits.doc(0).get("count");
count = Integer.parseInt(hit);
}

Is the reason it wasn't working due to the fact that I'm passing in a 
StandardAnalyzer?  I thought that maybe the searching mechanisms would 
be able to use or not use an analyzer according to what the field.index 
value is.


One other question that you may have an answer to:  I'm eventually going 
to need to alter the stop word list to include all default stop words, 
except those that match certain criteria.  Can this be done?


Thanks,
Ryan




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Range search in numeric fields

2007-04-03 Thread Antony Bowesman

Ivan Vasilev wrote:

Hi All,
I have the following problem:
I have to implement range search for fields that contain numbers. For 
example the field size that contains file size. The problem is that the 
numbers are not kept in strings with strikt length. There are field 
values like this: "32", "421", "1201". So when makeing search like this: 
+size:[10 TO 50], as the order for string is lexicorafical the result 
contains the documents with size 32 and 1201. I can see the following 
possible aproaches:
1. Changing indexing process so that all data entered in those fields is 
with fixed length. Example 032, 421, 0001201.

Disadvantages here are:
   - Have to be reindexed all existng indexes;
   - The index will grow a bit.


Look at Sols NumberUtils.int2sortableStr().  It will mean reindexing, but the 
number size has fixed storage size - 6 bytes for ints.  It converts numbers to a 
3 char Unicode representation which is sortable and therefore range searchable.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Benchmarking LUCENE-584 with contrib/benchmark

2007-04-02 Thread Antony Bowesman

Otis Gospodnetic wrote:

Here is one more related question.
It looks like the o.a.l.benchmark.Driver class is supposed to be a generic 
driver class that uses the Benchmarker configured in one of those conf/*.xml 
files.  However, I see StandardBenchmarker.class hard-coded there:

digester.addObjectCreate("benchmark/benchmarker", "class", 
StandardBenchmarker.class);  <==


Maybe I'm missing something, but isn't the 3rd param to addObjectCreate just a 
default and the real class is defined by the "class" attribute in the XML file.


e.g.



Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help - FileNotFoundException during IndexWriter.init()

2007-04-01 Thread Antony Bowesman

Michael McCandless wrote:

Yes, I've disabled it currently while the new test runs.  Let's see. 
I'll re-run the test a few more times and see if I can re-create the problem.


OK let's see if that makes it go away!  Hopefully :)


I ran the tests several times over the weekend with no virus checker in the DB 
directory and haven't managed to reproduce the problem.


Thanks for the help Mike.  Nothing like an exception never seen before, two days 
before the product is due to go live, to induce mild panic ;)


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help - FileNotFoundException during IndexWriter.init()

2007-03-31 Thread Antony Bowesman

Michael McCandless wrote:


Hmmm.  It seems like what's happening is the file in fact exists but
Lucene gets "Access is denied" when trying to read it.  Lucene takes a
listing of the directory, first.  So if it Lucene has permission to
take a directory listing but then no permission to open the segments_N
file for reading, that would cause this exception.

Is it possible that your virus checker thought the segments_gq9 file
had a virus and turned off "read permisson" in the ACLs for it?  Can
you check to see which specific file(s) it thought had viruses?


What's in the segments file?  The virus would have existed in one of the 
attachments in another directory queued for indexing.  Those files are queued by 
a process that has no involvement with the indexing and has not caused a problem 
on another machine with Mcafee on it.  This one has F-Secure.


Unfortunately I was too quick to cancel the message as assumed it was not a 
problem until I looked at the test logs.



Is it possible to tell the virus checker NOT to not check files in
your Lucene index directory?


Yes, I've disabled it currently while the new test runs.  Let's see.  I'll 
re-run the test a few more times and see if I can re-create the problem.


Thanks for the rapid response Mike
Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Help - FileNotFoundException during IndexWriter.init()

2007-03-31 Thread Antony Bowesman
I got the following exception this morning when running one last test on a data 
set that has been indexed many times before over the past few months.


java.io.FileNotFoundException: 
D:\72ed1\server\Java\Search\0008\index\0001\segments_gq9 (Access is denied)

at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at 
org.apache.lucene.store.FSIndexInput$Descriptor.(FSDirectory.java:497)

at org.apache.lucene.store.FSIndexInput.(FSDirectory.java:522)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:434)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:180)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:235)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:579)

at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:232)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:385)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:276)

On Windows with Lucene 2.1.

The test was run with only a single thread handling an incoming queue of 155K 
requests with ~200K documents so there is no concurrency issue.  There is no 
optimisation going on.  I am warming a new searcher every minute (for load) 
using a new reader.


I know that one of the files being indexed has a virus and when I came back to 
the machine the virus scanner had popped up at some point, so my suspicions are 
that it is the cause.  I am running the test again, but can any of the gurus 
give any ideas what can cause this.


It did have to happen the day after my deadline :(

Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Antony Bowesman

Peter Keegan wrote:

I implemented 'first wins' because the score is less important than other
fields (distance, in our case), but you make a good point since score 
may be

more important. How did you implement remove()?


I've got my own PriorityQueue

public boolean remove(E o)
{
if (o == null)
return false;

for (int i = 1; i <= size; i++)
{
if (queue[i] == o)
{
removeElement(i);
return true;
}
}
return false;
}

I've got a reference to the original object, so I'm using == to locate it.  I've 
not used equals() as I've not yet worked out whether that will cause me any 
problems with hashing.


Antony



Peter


On 3/29/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:


I've got a similar duplicate case, but my duplicates are based on an
external ID
rather than Doc id so occurs for a single Query.  It's using a custom
HitCollector but score based, not field sorted.

If my duplicate contains a higher score than one on the PQ I need to
update the
stored score with the higher one, so PQ needs a replace() method where 
the

stored object.equals() can be used to find the object to delete.  I'm not
sure
if there's a way to find the object efficiently in this case other than a
linear
search.  I implemented remove().

Peter, how did you achieve 'last wins' as you must presumably remove 
first

from
the PQ?

Antony


Peter Keegan wrote:
> The duplicate check would just be on the doc ID. I'm using TreeSet to
> detect
> duplicates with no noticeable affect on performance. The PQ only has to
be
> checked for a previous value IFF the element about to be inserted is
> actually inserted and not dropped because it's less than the least 
value

> already in there. So, the TreeSet is never bigger than the size of the
PQ
> (typically 25 to a few hundred items), not the size of all hits.
>
> Peter
>
> On 3/29/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>>
>> Hm, removing duplicates (as determined by a value of a specified
document
>> field) from the results would be nice.
>> How would your addition affect performance, considering it has to 
check

>> the PQ for a previous value for every candidate hit?
>>
>> Otis
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>>
>> - Original Message 
>> From: Peter Keegan <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, March 29, 2007 9:39:13 AM
>> Subject: FieldSortedHitQueue enhancement
>>
>> This is request for an enhancement to 
FieldSortedHitQueue/PriorityQueue

>> that
>> would prevent duplicate documents from being inserted, or
alternatively,
>> allow the application to prevent this (reason explained below). I can
do
>> this today by making the 'lessThan' method public and checking the
queue
>> before inserting like this:
>>
>> if (hq.size() < maxSize) {
>>// doc will be inserted into queue - check for duplicate before
>> inserting
>> } else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
>> (ScoreDoc)hq.top()) {
>>   // doc will be inserted into queue - check for duplicate before
>> inserting
>> } else {
>>   // doc will not be inserted - no check needed
>> }
>>
>> However, this is just replicating existing code in
>> PriorityQueue->insert().
>> An alternative would be to have a method like:
>>
>> public boolean wouldBeInserted(ScoreDoc doc)
>> // returns true if doc would be inserted, without inserting
>>
>> The reason for this is that I have some queries that get expanded into
>> multiple searches and the resulting hits are OR'd together. The 
queries

>> contain 'terms' that are not seen by Lucene but are handled by a
>> HitCollector that uses external data for each document to evaluate
hits.
>> The
>> results from the priority queue should contain no duplicate documents
>> (first
>> or last doc wins).
>>
>> Do any of these suggestions seem reasonable?. So far, I've been 
able to

>> use
>> Lucene without any modifications, and hope to continue this way.
>>
>> Peter



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Antony Bowesman
I've got a similar duplicate case, but my duplicates are based on an external ID 
rather than Doc id so occurs for a single Query.  It's using a custom 
HitCollector but score based, not field sorted.


If my duplicate contains a higher score than one on the PQ I need to update the 
stored score with the higher one, so PQ needs a replace() method where the 
stored object.equals() can be used to find the object to delete.  I'm not sure 
if there's a way to find the object efficiently in this case other than a linear 
search.  I implemented remove().


Peter, how did you achieve 'last wins' as you must presumably remove first from 
the PQ?


Antony


Peter Keegan wrote:
The duplicate check would just be on the doc ID. I'm using TreeSet to 
detect

duplicates with no noticeable affect on performance. The PQ only has to be
checked for a previous value IFF the element about to be inserted is
actually inserted and not dropped because it's less than the least value
already in there. So, the TreeSet is never bigger than the size of the PQ
(typically 25 to a few hundred items), not the size of all hits.

Peter

On 3/29/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:


Hm, removing duplicates (as determined by a value of a specified document
field) from the results would be nice.
How would your addition affect performance, considering it has to check
the PQ for a previous value for every candidate hit?

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Peter Keegan <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, March 29, 2007 9:39:13 AM
Subject: FieldSortedHitQueue enhancement

This is request for an enhancement to FieldSortedHitQueue/PriorityQueue
that
would prevent duplicate documents from being inserted, or alternatively,
allow the application to prevent this (reason explained below). I can do
this today by making the 'lessThan' method public and checking the queue
before inserting like this:

if (hq.size() < maxSize) {
   // doc will be inserted into queue - check for duplicate before
inserting
} else if (hq.size() > 0 && !hq.lessThan((ScoreDoc)fieldDoc,
(ScoreDoc)hq.top()) {
  // doc will be inserted into queue - check for duplicate before
inserting
} else {
  // doc will not be inserted - no check needed
}

However, this is just replicating existing code in
PriorityQueue->insert().
An alternative would be to have a method like:

public boolean wouldBeInserted(ScoreDoc doc)
// returns true if doc would be inserted, without inserting

The reason for this is that I have some queries that get expanded into
multiple searches and the resulting hits are OR'd together. The queries
contain 'terms' that are not seen by Lucene but are handled by a
HitCollector that uses external data for each document to evaluate hits.
The
results from the priority queue should contain no duplicate documents
(first
or last doc wins).

Do any of these suggestions seem reasonable?. So far, I've been able to
use
Lucene without any modifications, and hope to continue this way.

Peter




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scores from HitCollector

2007-03-29 Thread Antony Bowesman

Erick Erickson wrote:

I wound up using a TopDocs instead, which has a getMaxScore that
I was able to use to normalize scores to between 0 and 1. In my case
I was collapsing the results into quintiles, so I threw them all
back into a FieldSortedHitQueue to get them sorted by secondary
criteria once the scores were all one of 5 discrete values


My HitCollector is a variant of TopDocCollector and I have max score.  I found 
where Hits does the normalisation in Hits.getMoreDocs().  It simply multiplies 
all scores by (1/maxScore).


I was looking too deep down around the Scorer...

Can anyone say why this is useful and what's wrong about raw scores?

Thanks
Antony


On 3/29/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:


Hits will normalise scores >0<=1, but I'm using HitCollector and haven't
worked
out how to normalise those scores.

From what I can see, the scores are just multiplied by a factor to bring
the
top score down to 1.  Is this right or is there something more to it.

Do I need to normalise scores anyway - what's the reason it's done?
Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Scores from HitCollector

2007-03-29 Thread Antony Bowesman
Hits will normalise scores >0<=1, but I'm using HitCollector and haven't worked 
out how to normalise those scores.


From what I can see, the scores are just multiplied by a factor to bring the 
top score down to 1.  Is this right or is there something more to it.


Do I need to normalise scores anyway - what's the reason it's done?
Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Start/end offsets in analyzers

2007-03-28 Thread Antony Bowesman
Thanks Erik.  For our purposes it seems more generally useful to use the 
original start/end offsets.

Antony


Erik Hatcher wrote:


They aren't used implicitly by anything in Lucene, but can be very handy 
for efficient highlighting.  Where you set the offsets really all 
depends on how you plan on using the offset values.  In the synonym 
example you mention, if the original word is "dog" and the user searched 
for "canine", to properly highlight the word "dog" in the original text 
the offsets for "canine" need to be where "dog" is.


Erik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Start/end offsets in analyzers

2007-03-27 Thread Antony Bowesman
I'm fiddling with custom anaylyzers to analyze email addresses to store the full 
email address and the component parts.  It's based on Solr's analyzer framework, 
so I have a StandardTokenizerFactory followed by a EmailFilterFactory.  It produces


Analyzing "<[EMAIL PROTECTED]>"

1: [EMAIL PROTECTED]:1->31:]
2: [humphrey:1->9:]
3: [bogart:10->16:]
4: [casablanca:17->27:]
5: [com:28->31:]

I set the start/end offset to be the length of the component, but in the LIA 
book listing 4.6 shows the start/end offsets for the synonyms as the same as the 
original token, whereas I set my start/end as the correct start/end for the 
length and offset of the part.


LIA says these are not used in Lucene - is that still the case for 2.1 and does 
this matter?


Thanks
Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-26 Thread Antony Bowesman

Ryan Ackley wrote:

The 512 byte thing is a limitation of POIFS I think. I could be wrong
though. Have you tried opening the file with just POIFS?


It was some time ago, but it looks like I used both

org.apache.poi.hwpf.extractor.WordExtractor
org.apache.poi.hdf.extractor.WordDocument

with the same problem.
Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-25 Thread Antony Bowesman

Ryan Ackley wrote:

Yes I do have plans for adding fast save support and support for more
file formats. The time frame for this happening is the next couple of
months.


That would be good when it comes.  It would be nice if it could handle a 'brute 
force' mode where in the event of problems, it will just allow the text it can 
find to be extracted.  Currently if there is an Exception, I just run a raw 
strings parser on the file to fetch what I can.


One problem I found is that files not padded to 512 byte blocks cannot be 
parsed, but Words reads them happily.  They seem to be valid in other respects, 
i.e. they have the 1Table, Root Entry and other recognisable parts.  Padding the 
file to 512 byte boundary with nulls parses OK.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-25 Thread Antony Bowesman
I've been using Ryan's textmining in prefence to the POI as internally TM uses 
POI and the Word6 extractor so handles a greater variety of files.


Ryan, thanks for fixing your site.  Do you have any plans/ideas on how to parse 
the 'fast-saved' files and any ideas on Word files older than the Word 6 format?


Regards
Antony


Ryan Ackley wrote:

As the author of both Word POI and textmining.org, I recommend using
textmining.org. POI is for general purpose manipulation of Word
documents. textmining's only purpose is extracting text.

Also, people recommend using POI for text extraction but the only
place I've seen an actual how-to on this is in the "Lucene in Action"
book.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-23 Thread Antony Bowesman
www.textmining.org, but the site is no longer accessible.  Check Nutch which has 
a Word parser - it seems to be the original textmining.org Word6+POI parser.


Pre-word6 and "fast-saved" files will not work.  I've not found a solution for 
those
Antony


[EMAIL PROTECTED] wrote:

Thank you,
 
Are there other sollutions?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining score from two or more hits

2007-03-23 Thread Antony Bowesman

Chris Hostetter wrote:


if you are using a HitCollector, there any re-evaluation is going to
happen in your code using whatever mechanism you want -- once your collect
method is called on a docid, Lucene is done with that docid and no longer
cares about it ... it's only whatever storage you may be maintaining of
high scoring docs thta needs to know that you've decided the score has
changed.

your big problem is going to be that you basically need to maintain a list
of *every* doc collected, if you don't know what the score of any of them
are until you've processed all the rest ... since docs are collected in
increasing order of docid, you might be able to make some optimizations
based on how big of a gap you've got between the doc you are currently
collecting and the last doc you've collected if you know that you're
always going to add docs that "relate" to eachother in sequential bundles
-- but this would be some very custom code depending on your use case.


I only ever need to return a couple of ID fields per doc hit, so I load them 
with FieldCache when I start a new searcher.  These IDs refer to unique objects 
elsewhere, but there can be one or more instances of the same Id in the index 
due to the way I've structured Documents.  A Document = an attachment in the 
other system attached to the other system's object which can have 1...n 
attachments.  My problem is I need to return only unique external Ids with some 
kind of combined score up to the requested maxHits from the client.


Getting the unique Ids is no problem, but as you say I either have to store all 
hits and then sort them by score at the end once I know all unique docs, or do 
some clever stuff with some type of PriorityQueue that allows me to re-jig 
scores that already exist in the sorted queue.


One idea your comments raise is the relationship of docids to the group of 
Documents added for the higher level object.  All the Documents for the external 
object are added with a single writer at index time.  Assuming that the 
Documents for a single external Id will either all exist or none, then will the 
doc ids always be sequential for ever for that external Id or will they 
'reorganise' themselves?


Thanks
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining score from two or more hits

2007-03-22 Thread Antony Bowesman

Erick Erickson wrote:

Don't know if it's useful or not, but if you used  TopDocs instead,
you have access to an array of ScoreDoc which you could modify
freely. In my app, I used a FieldSortedHitQueue to re-sort things
when I needed to.


Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector 
variant of TopDocHitCollector.  The problem is not adjusting the score, it's 
what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2 
knowing that the original query resulted in hits on H1 AND H2.


Antony



ERick

On 3/22/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:


I have indexed objects that contain one or more attachments.  Each
attachment is
indexed as a separate Document along with the object metadata.

When I make a search, I may get hits in more than one Document that refer
to the
same object.  I have a HitCollector which knows if the object has already
been
found, so I want to be able to update the score of an existing hit in a
way that
makes sense.  e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 
is is

possible to re-score it on the basis that the real hit result is (H1 AND
H2).

I can take the highest score of any Document, but just wondered if 
this is

possible during the HitCollector.collect method?

Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing rss feeds in multiple languages

2007-03-22 Thread Antony Bowesman

Melanie Langlois wrote:

Well, thanks, sounds like the best option to me. Does anybody use the
PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on
the performances when using different analyzers.


I've not done any specifc comparisons between using a single Analyzer and 
multiple Analyzer with PFAW, but our indexes are typically 20-25 fields, each of 
which can have a different analyzer depending on language or field type, 
although in practice about 8-10 fields may use the non-default analyzer.


Performance is pretty good in any case and there's not been any noticeable 
degradtion when tweaking analyzers.

Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Combining score from two or more hits

2007-03-21 Thread Antony Bowesman
I have indexed objects that contain one or more attachments.  Each attachment is 
indexed as a separate Document along with the object metadata.


When I make a search, I may get hits in more than one Document that refer to the 
same object.  I have a HitCollector which knows if the object has already been 
found, so I want to be able to update the score of an existing hit in a way that 
makes sense.  e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 is is 
possible to re-score it on the basis that the real hit result is (H1 AND H2).


I can take the highest score of any Document, but just wondered if this is 
possible during the HitCollector.collect method?


Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: question about getting all terms in a section of the documents

2007-03-20 Thread Antony Bowesman

Donna L Gresh wrote:


Also, the terms.close()
statement is outside the scope of terms. I changed to the following, is 
this correct and should the

FAQ be changed?

try
{
TermEnum terms = indexReader.terms(new 
Term("FIELD-NAME-HERE", ""));
 
while ("FIELD-NAME-HERE".equals( 
terms.term().field()))

{
 // ... collect enum.term().text() ...
String term = terms.term().text();
System.out.println(term);
 if (!terms.next())
break;
}
terms.close();
}


I assume the original reason for the finally block was to demonstrate that the 
TermEnum must be closed, so perhaps it should be


TermEnum terms = null;
try
{
...
}
finally
{
   if (terms != null)
   terms.close();
}

the same applies to TermDocs.  Maybe others?
Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexWriter.deleteDocuments(Term) vs IndexReader.deleteDocuments(Term)

2007-03-15 Thread Antony Bowesman
The writer method does not return the number of deleted documents.  Is there a 
technical reason why this is not done.


I am planning to see about converting my batch deletions using IndexReader to 
IndexWriter, but I'm currently using the return value to record stats.


Does the following give the same results?

  int beforeCount = writer.docCount();
  writer.deleteDocuments(term);
  int deleted = beforeCount - writer.docCount();

Given that I add and delete in batches, is there any benefit to switching to 
IndexWriter for deletions?


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance between Filter and HitCollector?

2007-03-14 Thread Antony Bowesman
Thanks for the detailed reponse Hoss.  That's the sort of in depth golden nugget 
I'd like to see in a copy of LIA 2 when it becomes available...


I've wanted to use Filter to cache certain of my Term Queries, as it looked 
faster for straight Term Query searches, but Solr's DocSet interface abstraction 
is more useful.  HashDocSet will probably satisfy 90% of my cache.


Index DBs will typically be in the 1-3 million  documents range, but for mail 
which is spread over 1-6K user, so caching lots of BitSets for that number of 
users in not practical!


I ended up creating a DocSetFilter and creating DocSets (a la Solr) from BitSet 
which is then cached.  I then convert it back during Filter.bits().  Not the 
best solution, but the typical hit size is small, so the iteration is fast.


Thanks eks dev for the info about Lucene-584 - that looks like an interesting 
set of patches.


Antony

Chris Hostetter wrote:

it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
 performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
 (DocLists know about scores, sorting, and pagination) in a single
 pass when scores or sorting are requested.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread Antony Bowesman

Chris Hostetter wrote:

the only real reason you should really need 2 searchers at a time is if
you are searching other queries in parallel threads at the same time ...
or if you are warming up one new searcher that's "ondeck" while still
serving queries with an older searcher.


Hoss, I hope I misunderstood this: are you saying that the same 
IndexSearcher/IndexReader pair can not be used concurrently against a single 
index by different threads executing different queries?


The archives have several mentions of sharing IndexSearcher among threads and 
Otis says http://www.jguru.com/faq/view.jsp?EID=492393.


Can you clarify what you meant please.

Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Urgent] deleteDocuments fails after merging ...

2007-03-13 Thread Antony Bowesman

Erick Erickson wrote:

The javadocs point out that this line

* int* nb = mIndexReaderClone.deleteDocuments(urlTerm)

removes*all* documents for a given term. So of course you'll fail
to delete any documents the second time you call
deleteDocuments with the same term.


Isn't the code snippet below doing a search before attempting the deletion, so 
from the IndexReader's point of view (as used by the IndexSearcher) the item 
exists.  What is mIndexReaderClone?  Is that the same reader that is used in 
IndexSearcher?


I'm not sure, but if you search with one IndexReader and delete the document 
using another IndexReader and then repeat the process, I think that the search 
would still result in a hit, but the deletion would return 0.



On 3/13/07, DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> wrote:


Before I delete a document I search it in the index to be sure there is a
hit (via a Term object),
When I find a hit I delete the document (with the same Term object),



Hits hits = search(query);
*if* (hits.length() > 0) {
   * if* (hits.length() > 1) {
System.out.println("found in the index with duplicates");
}
System.out.println("found in the index");
   * try* {
   * int* nb = mIndexReaderClone.deleteDocuments(urlTerm);
   * if* (nb > 0)
System.out.println("successfully deleted");
   * else*
   * throw** new* IOException("0 doc deleted");
}* catch* (IOException e) {
e.printStackTrace();
   * throw** new* Exception(
Thread.currentThread().getName() + " --- Deleting


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wildcard searches with * or ? as the first character

2007-03-13 Thread Antony Bowesman
I have read that with Lucene it is not possible to do wildcard searches 
with * or ? as the first character. Wildcard searches with * as the 


Lucene supports it.  If you are using QueryParser to parse your queries see

http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html#setAllowLeadingWildcard(boolean)

Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Performance between Filter and HitCollector?

2007-03-12 Thread Antony Bowesman

There are (at least) two ways to generate a BitSet which can be used for 
filtering.

Filter.bits()

  BitSet bits = new BitSet(reader.maxDoc());
  TermDocs td = reader.termDocs(new Term("field", "text");
  while (td.next())
  {
  bits.set(td.doc());
  }
  return bits;

and HitCollector.collect(), as suggested in Javadocs

   final BitSet bits = new BitSet(indexReader.maxDoc());
   searcher.search(query, new HitCollector() {
   public void collect(int doc, float score) {
 bits.set(doc);
   }
 });

SOLR seems to use DocSetHitCollector in places which allows the DocSet interface 
to be used rather then plain old BitSet which allows small sets to be optimised, 
but does anyone know the performance implications of using HitCollector, if 
score is not required, against using Filter and then generating a DocSet?


Antony






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Antony Bowesman

Chris Hostetter wrote:

: equals to get q1.equals(q2).  The core Lucene Query implementations do 
override
: equals() to satisfy that test, but some of the contrib Query implementations 
do
: not override equals, so you would never see the same Query twice and caching
: BitSets for those Query instances would be a waste of time.

fileing bugs about those Query instances would be helpful .. bugs with
patches that demonstrate the problem in unit tests and fix them would be
even more helpful :)


OK, I'll put it on my todo list, but I've got to get the product out of the door 
this month...



These classes may prove useful in submitting test cases...

http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/search/QueryUtils.java?view=log
http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/search/CheckHits.java


Thanks for those pointers.
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing & search?

2007-03-06 Thread Antony Bowesman

Hi,


   I've indexed 4 among 5 fields with Field.Store.YES & Field.Index.NO. And
indexed the remaining one, say it's Field Name is *content*, with
Field.Store.YES & Field.Index.Tokenized(It's value is collective value of
other 4 fields and some more values).So my search always based on
*content*field.
   I've indexed 2 douments . In 1st doc, f1:mybook, f2:contains, f3:all,
f4:information, content:mybook contains all information that you need
and in 2nd   f1:somebody, f2:want, f3:search, f4:information,
content:somebody want search information of mybook
   I want to get search results of all docs where field1's value is
"mybook".My query is content:mybook.But it returns 2 matching documents
instead of 1.


The example shows the first 4 words of each 'content' being stored as f1, f2, 
f3, f4.  If that is your intention, then you can use SpanFirstQuery to find 
words that were in f1.  It can also be used to find hits in words 2-4, but you 
will have to test the hits to find out the positional match.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Antony Bowesman

Chris Hostetter wrote:

: I was hoping that Query.equals() would be defined so that equality would be
: based on the results that Query generates for a given reader.

if query1.equals(query2) then the results of query1 on an
indexreader should be identical to the results of query2 on the same
indexreader 


Thanks Hoss and Erik.  This is the case I wanted, but re-reading my desire 
above, I see it looks more like the inverse.  Sorry for the confusion.



... but there inverse can not be garunteed: if query1 and
query2 generate identical results when queried against an indexreader that
says absolutely nothing about wether query1.equals(query2).


Yes, that's not what I was after - As you say, it's not possible to implement.


in general, what you describe really isn't needed for caching query result
sets ... what matters is that if you've already seen the query before
(which you can tell using q1.equals(q2)) then you don't need to execute it


Exactly, and to be sure of that you have to be able to rely on an overridden 
equals to get q1.equals(q2).  The core Lucene Query implementations do override 
equals() to satisfy that test, but some of the contrib Query implementations do 
not override equals, so you would never see the same Query twice and caching 
BitSets for those Query instances would be a waste of time.


Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  1   2   >