What doc id to use on IndexReader with SetNextReader

2011-04-18 Thread Antony Bowesman
Migrating some code from 2.3.2 to 2.9.4 and I have custom Collectors. Now there are multiple calls to collect and each call needs to adjust the passed doc id by docBase as given in SetNextReader. However, if you want to fetch the document in the collector, what docId/IndexReader combination

Re: What doc id to use on IndexReader with SetNextReader

2011-04-18 Thread Antony Bowesman
Thanks Uwe, I assumed as much. On 18/04/2011 7:28 PM, Uwe Schindler wrote: Document d = reader.document(doc) This is the correct way to do it. Uwe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For

Index time boost question

2011-04-14 Thread Antony Bowesman
I have a test case written for 2.3.2 that tested an index time boost on a field of 0.0F and then did a search using Hits and got 0 results. I'm now in the process of upgrading to 2.9.4 and am removing all use of Hits in my test cases and using a Collector instead. Now the test case fails as

NullPointerException in FieldSortedHitQueue

2011-04-14 Thread Antony Bowesman
Upgrading from 2.3.2 to 2.9.4 I get NPE as below Caused by: java.lang.NullPointerException at org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:224) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224) at

DocIdSet to represent small numberr of hits in large Document set

2011-04-05 Thread Antony Bowesman
I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4). Many of our indexes are 5M+ Documents, however, only a small subset of these are relevant to any user. As a DocIdSet, backed by a BitSet or OpenBitSet, is rather inefficient in terms of memory use, what is the recommended

TopFieldDocCollector and v3.0.0

2009-12-07 Thread Antony Bowesman
I'm on 2.3.2 and looking to move to 2.9.1 or 3.0.0 In 2.9.1 TopFieldDocCollector is Deprecated. Please use TopFieldCollector instead. in 3.0.0 TopFieldCollector says NOTE: This API is experimental and might change in incompatible ways in the next release What is the suggested path for

NumberFormatException when creating field cache

2009-09-09 Thread Antony Bowesman
I'm using Lucene 2.3.2 and have a date field used for sorting, which is MMDDHHMM. I get an exception when the FieldCache is being generated as follows: java.lang.NumberFormatException: For input string: 190400-412317

Re: TermEnum with deleted dccuments

2009-05-10 Thread Antony Bowesman
, Antony Bowesman a...@teamware.com wrote: I am merging Index A to Index B. First I read the terms for a particular field from index A and some of the documents in A get deleted. I then enumerate the terms on a different field also in index A, but the terms from the deleted document are still

TermEnum with deleted dccuments

2009-05-06 Thread Antony Bowesman
I am merging Index A to Index B. First I read the terms for a particular field from index A and some of the documents in A get deleted. I then enumerate the terms on a different field also in index A, but the terms from the deleted document are still present. The termEnum.docFreq() also

Which is more efficient

2009-05-05 Thread Antony Bowesman
Just wondered which was more efficient under the hood for (int i = 0; i size; i++) terms[i] = new Term(id, doc_key[i]); This writer.deleteDocuments(terms); for (int i = 0; i size; i++) writer.addDocument(doc[i]); Or this for (int i = 0; i size; i++)

Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman
Michael McCandless wrote: Lucene doesn't provide any way to do this, except opening a reader. Opening a reader is not that expensive if you use it for this purpose. EG neither norms nor FieldCache will be loaded if you just enumerate the term docs. Thanks for that info. These indexes will

Re: How to not overwrite a Document if it 'already exists'?

2009-05-05 Thread Antony Bowesman
Thanks for that info. These indexes will be large, in the 10s of millions. id field is unique and is 29 bytes. I guess that's still a lot of data to trawl through to get to the term. Have you tested how long it takes to look up docs from your id? Not in indexes that size in a live

Re: Lucene 2.4 - Searching

2009-01-27 Thread Antony Bowesman
Karl Heinz Marbaise wrote: I have a field which is called filename and contains a filename which can of course be lowercase or upppercase or a mixture... I would like to do the following: +filename:/*scm*.doc That should result in getting things like /...SCMtest.doc /...scmtest.doc

Re: addIndexesNoOptimize question

2008-12-19 Thread Antony Bowesman
Thanks Mike, I'm still on 2.3.1, so will upgrade soon. Antony Michael McCandless wrote: This was an attempt on addIndexesNoOptimize's part to respect the maxMergeDocs (which prevents large segments from being merged) you had set on IndexWriter. However, the check was too pedantic, and was

addIndexesNoOptimize question

2008-12-17 Thread Antony Bowesman
The javadocs state This requires ... and the upper bound* of those segment doc counts not exceed maxMergeDocs. Can one of the gurus please explain what that means and what needs to be done to find out whether an index being merged fits that criteria. Thanks Antony

Re: Which is faster/better

2008-11-25 Thread Antony Bowesman
Michael McCandless wrote: If you have nothing open already, and all you want to do is delete certain documents and make a commit point, then using IndexReader vs IndexWriter should show very little difference in speed. Thanks. This use case can assume there may be nothing open. I prefer

Which is faster/better

2008-11-24 Thread Antony Bowesman
In 2.4, as well as IndexWriter.deleteDocuments(Term) there is also IndexReader.deleteDocuments(Term). I understand opening a reader is expensive, so does this means using IndexWriter.deleteDocuments would be faster from a closed index position? As the IndexReader instance is newer, it has

Re: distinct field values

2008-10-14 Thread Antony Bowesman
Akanksha Baid wrote: I have indexed multiple documents - each of them have 3 fields ( id, tag , text). Is there an easy way to determine the set of tags for a given query without iterating through all the hits? For example if I have 100 documents in my index and my set of tag = {A, B, C}.

Re: Phrase Query

2008-09-16 Thread Antony Bowesman
Is it possible to write a document with different analyzers in different fields? PerFieldAnalyzerWrapper - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Caching Filters and docIds when using MultiSearcher/IndexSearcher(MultiReader)...

2008-09-11 Thread Antony Bowesman
Up to now I have only needed to search a single index, but now I will have many index shards to search across. My existing search mantained cached filters for the index as well as a cache of my own unique ID fields in the index, keyed by Lucene DocId. Now I need to search multiple indices, I

Re: Merging indexes - which is best option?

2008-09-09 Thread Antony Bowesman
Thanks Karsten, I decided first to delete all duplicates from master(iW) and then to insert all temporary indices(other). I reached the same conclusion. As your code shows, it's a simple enough solution. You had a good point with the iW.abort() in the rollback case. Antony

Javadoc wording in IndexWriter.addIndexesNoOptimize()

2008-09-04 Thread Antony Bowesman
The Javadoc for this method has the following comment: This requires this index not be among those to be added, and the upper bound* of those segment doc counts not exceed maxMergeDocs. What does the second part of that mean, which is especially confusing given that MAX_MERGE_DOCS is

Merging indexes - which is best option?

2008-09-04 Thread Antony Bowesman
I am creating several temporary batches of indexes to separate indices and periodically will merge those batches to a set of master indices. I'm using IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the master may already contain the index for that document and I get a

Can TermDocs.skipTo() go backwards

2008-08-27 Thread Antony Bowesman
I have a custom TopDocsCollector and need to collect a payload from each final document hit. The payload comes from a single term in each hit. When collecting the payload, I don't want to fetch the payload during the collect() method as it will make fetches which may subsequently be bumped

Re: Can TermDocs.skipTo() go backwards

2008-08-27 Thread Antony Bowesman
Michael McCandless wrote: TermDocs.skipTo() only moves forwards. Can you use a stored field to retrieve this information, or do you really need to store it per-term-occurrence in your docs? I discussed my use case with Doron earlier and there were two options, either to use payloads or

Re: Can TermDocs.skipTo() go backwards

2008-08-27 Thread Antony Bowesman
Michael McCandless wrote: Ahh right, my short term memory failed me ;) I now remember this thread. Excused :) I expect you have real work to occupy your mind! Yes, though LUCENE-1231 (column stride stored fields) should help this. I see from JIRA that MB has started working on this -

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-18 Thread Antony Bowesman
Doron Cohen wrote: The API definitely doesn't promise this. AFAIK implementation wise it happens to be like this but I can be wrong and plus it might change in the future. It would make me nervous to rely on this. I made some tests and it 'seems' to work, but I agree, it also makes me nervous

Re: Multiple index performance

2008-08-18 Thread Antony Bowesman
Cyndy wrote: I want to keep user text files indexed separately, I will have about 10,000 users and each user may have about 20,000 short files, and I need to keep privacy. So the idea is to have one folder with the text files and index for each user, so when search will be done, it will be

Re: Multiple index performance

2008-08-18 Thread Antony Bowesman
[EMAIL PROTECTED] wrote: Thanks Anthony for your response, I did not know about that field. You make your own fields in Lucene, it is not something Lucene gives you. But still I have a problem and it is about privacy. The users are concerned about privacy and so, we thought we could have

Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-17 Thread Antony Bowesman
I assume you already know this but just to make sure what I meant was clear - on tokenization but still indexing just means that the entire field's text becomes a single unchanged token. I believe this is exactly what SingleTokenTokenStream can buy you - a single token, for which you can pre set

Re: Payloads and tokenizers

2008-08-14 Thread Antony Bowesman
Thanks for your comments Doron. I found the earlier discussions on the dev list (21/12/06), where this issue is discussed - my use case is similar to Nadav Har'El. Implementing payloads via Tokens explicitly prevents the use of payloads for untokenized fields, as they only support

Payloads and tokenizers

2008-08-13 Thread Antony Bowesman
I started playing with payloads and have been trying to work out how to get the data into the payload I have a field where I want to add the following untokenized fields A1 A2 A3 With these fields, I would like to add the payloads B1 B2 B3 Firstly, it looks like you cannot add payloads to

Re: Per user data store

2008-08-05 Thread Antony Bowesman
Ganesh - yahoo wrote: Hello all, Documents coressponding to multiple users are to be indexed. Each user is going to search only his documents. Only Administrator could search all users data. Is it good to have one database for each User or to have only one database for all Users? Which will be

Modifying a document by updating a payloads?

2008-07-30 Thread Antony Bowesman
I seem to recall some discussion about updating a payload, but I can't find it. I was wondering if it were possible to use a payload to implement 'modify' of a Lucene document. For example, I have an ID field, which has a unique ID refering to an external DB. For example, I would like to

Re: Modifying a document by updating a payloads?

2008-07-30 Thread Antony Bowesman
Hi Mike, Unfortunately you will have to delete the old doc, then reindex a new doc, in order to change any payloads in the document's Tokens. This issue: https://issues.apache.org/jira/browse/LUCENE-1231 which is still in progress, could make updating stored (but not indexed) fields a

Rebuilding parallel indexes

2008-06-09 Thread Antony Bowesman
I have a design where I will be using multiple index shards to hold approx 7.5 million documents per index per month over many years. These will be large static R/O indexes but the corresponding smaller parallel index will get many frequent changes. I understand from previous replies by Hoss

Re: Rebuilding parallel indexes

2008-06-09 Thread Antony Bowesman
Andrzej Bialecki wrote: I have a thought ;) Perhaps you could use a FilteredIndexReader to maintain a map between new IDs and old IDs, and remap on the fly. Although I think that some parts of Lucene depend on the fact that in a normal index the IDs are monotonically increasing ... this would

Re: Binding lucene instance/threads to a particular processor(or core)

2008-04-21 Thread Antony Bowesman
That paper from 1997 is pretty old, but mirrors our experiences in those days. Then, we used Solaris processor sets to really improve performance by binding one of our processes to a particular CPU while leaving the other CPUs to manage the thread intensive work. You can bind processes/LWPs

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Antony Bowesman
Thanks all for the suggestions - there was also another thread Lucene index on relational data which had crossover here. That's an interesting idea about using ParallelReader for the changable index. I had thought to just have a triplet indexed 'owner:mailId:label' in each Doc and have

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Antony Bowesman
Chris Hostetter wrote: you can't ... that's why i said you'd need to rebuild the smaller index completley on a periodic basis (going in the same order as the docs in the Mmm, the annotations would only be stored in the index. It would be possible to store them elsewhere, so I can

Using Lucene partly as DB and 'joining' search results.

2008-04-11 Thread Antony Bowesman
We're planning to archive email over many years and have been looking at using DB to store mail meta data and Lucene for the indexed mail data, or just Lucene on its own with email data and structure stored as XML and the raw message stored in the file system. For some customers, the volumes

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-11 Thread Antony Bowesman
Paul Elschot wrote: Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme: Use Filter and BitSet. From the personnal data, you build a Filter (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Fil ter.html) wich is used in the main index. With 1 billion mails, and

Re: How to improve performance of large numbers of successive searches?

2008-04-10 Thread Antony Bowesman
Chris McGee wrote: These tips have significantly improved the time to build the directory and search it. However, I have noticed that when I perform term queries using a searcher many times in rapid succession and iterate over all of the hits it can take a significant time. To perform 1000

Re: Search emails - parsing mailbox (mbox) files

2008-04-04 Thread Antony Bowesman
Subodh Damle wrote: Is there any reliable implementation for parsing email mailbox files (mbox format), especially large (50MB) archives ? Even after searching lucene mailing list archives, googling around, I couldn't find one. I took a look at Apache James project which seems to offer some

Re: Biggest index

2008-03-16 Thread Antony Bowesman
[EMAIL PROTECTED] wrote: Yes of course, the answers to your questions are important too. But no anwser at all until now :( One example: 1.5 million documents Approx 15 fields per document DB is 10-15GB (can't find correct figure) All on one machine. No stats on search usage though. We're

Re: Using RangeFilter

2008-01-24 Thread Antony Bowesman
vivek sar wrote: I've a field as NO_NORM, does it has to be untokenized to be able to sort on it? NO_NORMS is the same as UNTOKENIZED + omitNorms, so you can sort on that. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED]

Re: Multiple searchers (Was: CachingWrapperFilter: why cache per IndexReader?)

2008-01-23 Thread Antony Bowesman
Toke Eskildsen wrote: == Average over the first 50.000 queries == metis_flash_RAID0_8GB_i37_t2_l21.log - 279.6 q/sec metis_flash_RAID0_8GB_i37_t2_l23.log - 202.3 q/sec metis_flash_RAID0_8GB_i37_v23_t2_l23.log - 195.9 q/sec == Average over the first 340.000 queries ==

DateTools UTC/GMT mismatch

2008-01-22 Thread Antony Bowesman
Hi, I just noticed that although the Javadocs for Lucene 2.2 state that the dates for DateTools use UTC as a timezone, they are actually using GMT. Should either the Javadocs be corrected or the code corrected to use UTC instead. Antony

Re: Using RangeFilter

2008-01-21 Thread Antony Bowesman
vivek sar wrote: I need to be able to sort on optime as well, thus need to store it . Lucene's default sorting does not need the field to be stored, only indexed as untokenized. Antony - To unsubscribe, e-mail: [EMAIL

Re: Lucene sorting case-sensitive by default?

2008-01-15 Thread Antony Bowesman
Erick Erickson wrote: doc.add( new Field( f, This is Some Mixed, case Junk($*% With Ugly SYmbols, Field.Store.YES, Field.Index.TOKENIZED)); snip

Re: how do I get my own TopDocHitCollector?

2008-01-10 Thread Antony Bowesman
the searcher and places them in the cache? -Original Message- From: Antony Bowesman [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 09, 2008 7:19 PM To: java-user@lucene.apache.org Subject: Re: how do I get my own TopDocHitCollector? Beard, Brian wrote: Question: The documents

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Antony Bowesman
Ariel wrote: The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm

Re: how do I get my own TopDocHitCollector?

2008-01-09 Thread Antony Bowesman
Beard, Brian wrote: Question: The documents that I index have two id's - a unique document id and a record_id that can link multiple documents together that belong to a common record. I'd like to use something like TopDocs to return the first 1024 results that have unique record_id's, but I

Re: Deleting a single TermPosition doc, frequency, position for a Document

2008-01-08 Thread Antony Bowesman
Antony Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Antony Bowesman [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, January 8, 2008 12:47:05 AM Subject: Deleting a single TermPosition doc, frequency, position

Deleting a single TermPosition doc, frequency, position for a Document

2008-01-07 Thread Antony Bowesman
I'd like to 'update' a single Document in a Lucene index. In practice, this 'update' is actually just a removal of a single TermPosition for a given Term for a given doc Id. I don't think this is currently possible, but would it be easy to change Lucene to support this type of usage? The

Concurrency between IndexReader and IndexWriter

2007-12-09 Thread Antony Bowesman
My application batch adds documents to the index using IndexWriter.addDocument. Another thread handles searchers, creating new ones as needed, based on a policy. These searchers open a new IndexReader and there is currently no synchronisation between this action and any being performed by my

Re: Concurrency between IndexReader and IndexWriter

2007-12-09 Thread Antony Bowesman
Using Lucene 2.1 Antony Bowesman wrote: My application batch adds documents to the index using IndexWriter.addDocument. Another thread handles searchers, creating new ones as needed, based on a policy. These searchers open a new IndexReader and there is currently no synchronisation between

Re: Concurrency between IndexReader and IndexWriter

2007-12-09 Thread Antony Bowesman
Looks like I got myself into a twist for nothing - the reader will see a consistent view, despite what the writer does as long as the reader remains open. Appologies for the noise... Antony - To unsubscribe, e-mail: [EMAIL

Re: deleteDocuments by Term[] for ALL terms

2007-12-04 Thread Antony Bowesman
delCount = 0; while(scorer.next()) { reader.deleteDocument(scorer.doc()); delCount++; } that iterates over all the docIDs without scoring them and without building up a Hit for each, etc. Mike Antony Bowesman [EMAIL PROTECTED] wrote: Hi, I'm using IndexReader.deleteDocuments(Term

deleteDocuments by Term[] for ALL terms

2007-11-25 Thread Antony Bowesman
Hi, I'm using IndexReader.deleteDocuments(Term) to delete documents in batches. I need the deleted count, so I cannot use IndexWriter.deleteDocuments(). What I want to do is delete documents based on more than one term, but not like IndexWriter.deleteDocuments(Term[]) which deletes all

Re: efficient way to filter out unwanted results

2007-06-15 Thread Antony Bowesman
yu wrote: Thanks Sawan for the suggestion. I guess this will work for statically known doc ids. In my case, I know only external ids that I want to exclude from the result set.for each search. Of course, I can always exclude these docs in a post search process. I am curious if there are

Re: Contains query parsed to PrefixQuery

2007-06-11 Thread Antony Bowesman
It's a bug in 2.1, fixed by Doron Cohen http://issues.apache.org/jira/browse/LUCENE-813 Antony dontspamterry wrote: Hi all, I was experimenting with queries using wildcard on an untokenized field and noticed that a query with both a starting and trailing wildcard, e.g. *abc*, gets parsed to

Re: How can I search over all documents NOT in a certain subset?

2007-06-08 Thread Antony Bowesman
Hilton Campbell wrote: Yes, that's actually come up. The document ids are indeed changing which is causing problems. I'm still trying to work it out myself, but any help would most definitely be appreciated. If you have an application Id per document, then you could cache that field for

Re: How can I search over all documents NOT in a certain subset?

2007-06-06 Thread Antony Bowesman
Steven Rowe wrote: Conceptually (caveat: untested), you could: 1. Extend Filter[1] (call it DejaVuFilter) to hold a BitSet per IndexReader. The BitSet would hold one bit per doc[2], each initialized to true. 2. Unset a DejaVuFilter instance's bit for each of your top N docs by walking the

Re: Does Lucene search over memory too?

2007-05-29 Thread Antony Bowesman
Doron Cohen wrote: Antony Bowesman [EMAIL PROTECTED] wrote on 28/05/2007 22:48:41: I read the new IndexWriter Javadoc and I'm unclear about this autocommit. In 2.1, I thought an IndexReader opened in an IndexSearcher does not see additions to an index made by an IndexWriter, i.e. maxDoc

Re: Does Lucene search over memory too?

2007-05-28 Thread Antony Bowesman
Michael McCandless wrote: The autoCommit mode for IndexWriter has not actually been released yet: you can only use it on the trunk. It actually serves a different purpose: it allows you to make sure your searchers do not see any changes made by the writer (even the ones that have been flushed)

Re: Memory leak (JVM 1.6 only)

2007-05-17 Thread Antony Bowesman
Daniel Noll wrote: On Tuesday 15 May 2007 21:59:31 Narednra Singh Panwar wrote: try using -Xmx option with your Application. and specify maximum/ minimum memory for your Application. It's funny how a lot of people instantly suggest this. What if it isn't possible? There was a situation a

Re: Turning PrefixQuery into a TermQuery

2007-04-11 Thread Antony Bowesman
Steffen Heinrich wrote: Normally an IndexWriter uses only one default Analyzer for all its tokenizing businesses. And while it is appearantly possible to supply a certain other instance when adding a specific document there seems to be no way to use different analyzers on different fields

Re: Not able to search on UN_TOKENIZED fields

2007-04-09 Thread Antony Bowesman
You can either use KeywordAnalyzer with your QueryParser which will correctly handle UN_TOKENIZED fields, but that will use KeywordAnalyzer for all fields. To use a field specific Analyzer you either need to use PerFieldAnalyzerWrapper and preload it with all possible fields and use that as

Re: Benchmarking LUCENE-584 with contrib/benchmark

2007-04-03 Thread Antony Bowesman
Otis Gospodnetic wrote: Here is one more related question. It looks like the o.a.l.benchmark.Driver class is supposed to be a generic driver class that uses the Benchmarker configured in one of those conf/*.xml files. However, I see StandardBenchmarker.class hard-coded there:

Re: Help - FileNotFoundException during IndexWriter.init()

2007-04-01 Thread Antony Bowesman
Michael McCandless wrote: Yes, I've disabled it currently while the new test runs. Let's see. I'll re-run the test a few more times and see if I can re-create the problem. OK let's see if that makes it go away! Hopefully :) I ran the tests several times over the weekend with no virus

Help - FileNotFoundException during IndexWriter.init()

2007-03-31 Thread Antony Bowesman
I got the following exception this morning when running one last test on a data set that has been indexed many times before over the past few months. java.io.FileNotFoundException: D:\72ed1\server\Java\Search\0008\index\0001\segments_gq9 (Access is denied) at

Re: Help - FileNotFoundException during IndexWriter.init()

2007-03-31 Thread Antony Bowesman
Michael McCandless wrote: Hmmm. It seems like what's happening is the file in fact exists but Lucene gets Access is denied when trying to read it. Lucene takes a listing of the directory, first. So if it Lucene has permission to take a directory listing but then no permission to open the

Scores from HitCollector

2007-03-29 Thread Antony Bowesman
Hits will normalise scores 0=1, but I'm using HitCollector and haven't worked out how to normalise those scores. From what I can see, the scores are just multiplied by a factor to bring the top score down to 1. Is this right or is there something more to it. Do I need to normalise scores

Re: Scores from HitCollector

2007-03-29 Thread Antony Bowesman
anyone say why this is useful and what's wrong about raw scores? Thanks Antony On 3/29/07, Antony Bowesman [EMAIL PROTECTED] wrote: Hits will normalise scores 0=1, but I'm using HitCollector and haven't worked out how to normalise those scores. From what I can see, the scores are just multiplied

Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Antony Bowesman
I've got a similar duplicate case, but my duplicates are based on an external ID rather than Doc id so occurs for a single Query. It's using a custom HitCollector but score based, not field sorted. If my duplicate contains a higher score than one on the PQ I need to update the stored score

Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Antony Bowesman
, so I'm using == to locate it. I've not used equals() as I've not yet worked out whether that will cause me any problems with hashing. Antony Peter On 3/29/07, Antony Bowesman [EMAIL PROTECTED] wrote: I've got a similar duplicate case, but my duplicates are based on an external ID rather

Start/end offsets in analyzers

2007-03-28 Thread Antony Bowesman
I'm fiddling with custom anaylyzers to analyze email addresses to store the full email address and the component parts. It's based on Solr's analyzer framework, so I have a StandardTokenizerFactory followed by a EmailFilterFactory. It produces Analyzing [EMAIL PROTECTED] 1: [EMAIL

Re: Start/end offsets in analyzers

2007-03-28 Thread Antony Bowesman
Thanks Erik. For our purposes it seems more generally useful to use the original start/end offsets. Antony Erik Hatcher wrote: They aren't used implicitly by anything in Lucene, but can be very handy for efficient highlighting. Where you set the offsets really all depends on how you plan

Re: index word files ( doc )

2007-03-26 Thread Antony Bowesman
Ryan Ackley wrote: The 512 byte thing is a limitation of POIFS I think. I could be wrong though. Have you tried opening the file with just POIFS? It was some time ago, but it looks like I used both org.apache.poi.hwpf.extractor.WordExtractor org.apache.poi.hdf.extractor.WordDocument with the

Re: index word files ( doc )

2007-03-25 Thread Antony Bowesman
I've been using Ryan's textmining in prefence to the POI as internally TM uses POI and the Word6 extractor so handles a greater variety of files. Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse the 'fast-saved' files and any ideas on Word files older than the

Re: index word files ( doc )

2007-03-24 Thread Antony Bowesman
www.textmining.org, but the site is no longer accessible. Check Nutch which has a Word parser - it seems to be the original textmining.org Word6+POI parser. Pre-word6 and fast-saved files will not work. I've not found a solution for those Antony [EMAIL PROTECTED] wrote: Thank you, Are

Re: Combining score from two or more hits

2007-03-23 Thread Antony Bowesman
Chris Hostetter wrote: if you are using a HitCollector, there any re-evaluation is going to happen in your code using whatever mechanism you want -- once your collect method is called on a docid, Lucene is done with that docid and no longer cares about it ... it's only whatever storage you may

Re: indexing rss feeds in multiple languages

2007-03-22 Thread Antony Bowesman
Melanie Langlois wrote: Well, thanks, sounds like the best option to me. Does anybody use the PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on the performances when using different analyzers. I've not done any specifc comparisons between using a single Analyzer and

Combining score from two or more hits

2007-03-21 Thread Antony Bowesman
I have indexed objects that contain one or more attachments. Each attachment is indexed as a separate Document along with the object metadata. When I make a search, I may get hits in more than one Document that refer to the same object. I have a HitCollector which knows if the object has

Re: question about getting all terms in a section of the documents

2007-03-20 Thread Antony Bowesman
Donna L Gresh wrote: Also, the terms.close() statement is outside the scope of terms. I changed to the following, is this correct and should the FAQ be changed? try { TermEnum terms = indexReader.terms(new Term(FIELD-NAME-HERE, ));

IndexWriter.deleteDocuments(Term) vs IndexReader.deleteDocuments(Term)

2007-03-15 Thread Antony Bowesman
The writer method does not return the number of deleted documents. Is there a technical reason why this is not done. I am planning to see about converting my batch deletions using IndexReader to IndexWriter, but I'm currently using the return value to record stats. Does the following give

Re: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread Antony Bowesman
Chris Hostetter wrote: the only real reason you should really need 2 searchers at a time is if you are searching other queries in parallel threads at the same time ... or if you are warming up one new searcher that's ondeck while still serving queries with an older searcher. Hoss, I hope I

Re: Performance between Filter and HitCollector?

2007-03-14 Thread Antony Bowesman
Thanks for the detailed reponse Hoss. That's the sort of in depth golden nugget I'd like to see in a copy of LIA 2 when it becomes available... I've wanted to use Filter to cache certain of my Term Queries, as it looked faster for straight Term Query searches, but Solr's DocSet interface

Re: Wildcard searches with * or ? as the first character

2007-03-13 Thread Antony Bowesman
I have read that with Lucene it is not possible to do wildcard searches with * or ? as the first character. Wildcard searches with * as the Lucene supports it. If you are using QueryParser to parse your queries see

Re: [Urgent] deleteDocuments fails after merging ...

2007-03-13 Thread Antony Bowesman
Erick Erickson wrote: The javadocs point out that this line * int* nb = mIndexReaderClone.deleteDocuments(urlTerm) removes*all* documents for a given term. So of course you'll fail to delete any documents the second time you call deleteDocuments with the same term. Isn't the code snippet

Performance between Filter and HitCollector?

2007-03-12 Thread Antony Bowesman
There are (at least) two ways to generate a BitSet which can be used for filtering. Filter.bits() BitSet bits = new BitSet(reader.maxDoc()); TermDocs td = reader.termDocs(new Term(field, text); while (td.next()) { bits.set(td.doc()); } return bits; and

Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Antony Bowesman
Erik Hatcher wrote: Have a look at the CachingWrappingFilter: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/CachingWrapperFilter.html It caches filters by IndexReader instance. Doesn't that still have the same issue in terms of equality of conditions that created

Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Antony Bowesman
Chris Hostetter wrote: : I was hoping that Query.equals() would be defined so that equality would be : based on the results that Query generates for a given reader. if query1.equals(query2) then the results of query1 on an indexreader should be identical to the results of query2 on the same

Re: Indexing search?

2007-03-06 Thread Antony Bowesman
Hi, I've indexed 4 among 5 fields with Field.Store.YES Field.Index.NO. And indexed the remaining one, say it's Field Name is *content*, with Field.Store.YES Field.Index.Tokenized(It's value is collective value of other 4 fields and some more values).So my search always based on

Re: Caching of BitSets from filters and Query.equals()

2007-03-06 Thread Antony Bowesman
Chris Hostetter wrote: : equals to get q1.equals(q2). The core Lucene Query implementations do override : equals() to satisfy that test, but some of the contrib Query implementations do : not override equals, so you would never see the same Query twice and caching : BitSets for those Query

Caching of BitSets from filters and Query.equals()

2007-03-05 Thread Antony Bowesman
Not sure if I'm going about this the right way, but I want to use Query instances as a key to a HashMap to cache BitSet instances from filtering operations. They are all for the same reader. That means equals() for any instance of the same generic Query would have to return true if the

Re: TextMining.org Word extractor

2007-03-04 Thread Antony Bowesman
The Nutch sources contain Ryan Ackley's Word6Extractor which has the header /* Copyright 2004 Ryan Ackley * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * *

Re: Best way to returning hits after search?

2007-03-01 Thread Antony Bowesman
If you decide to cache stored field value in memory, FieldCache may be useful for this - so you don't have to implement your own cache - you can access the field values with something like: FieldCache fieldCache = FieldCache.DEFAULT; String db_id_field[] =

Best way to returning hits after search?

2007-02-27 Thread Antony Bowesman
I am doing what I should not, i.e. iterating the Hits after a search to collect two ID fields from each document in Hits to pass back to the searcher along with the score. The index is approx 10-15 fields per doc, and indexes mail data, which is not stored, as it exists elsewhere. Each mail

  1   2   >