IndexSearch very slow after reopening the index

2010-10-14 Thread subwayne
Hi, I'am facing some problems in using Lucene. The index I am using is constructed like this: try { Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, English); Directory dir = MMapDirectory.open(index); IndexWriter writer = new IndexWriter(dir, analyzer, MaxFieldLength.LIMITED);

Re: IndexSearch very slow after reopening the index

2010-10-14 Thread subwayne
Hi Ian, thank you for your quick response. I am running Lucene on Ubuntu 10.04, 64 bit. I switched from MMapDirectory to NIOFSDirectory without any significant changes in performance. The Lucene version running is 3.0.2. I followed your advice and opened the IndexSearcher after I added all

Storing additional Metadata with Fields

2010-10-14 Thread Christoph Hermann
Hi, is there a way to store additional metadata with fields? My Problem is as follows: I'm extracting extended html with tika. This extended html contains references to pages, x,y values of the text etc. I want to be able to retrieve those values when text was found while searching. So when

Re: IndexSearch very slow after reopening the index

2010-10-14 Thread Pradeep Singh
Many times when you run a search for the first time it has to load all field values IF the field is being sorted on. Subsequent searches use that cache and are faster. Does that happen in your case? From your description it doesn't look like you are sorting, although this kind of performance

Re: Storing additional Metadata with Fields

2010-10-14 Thread Pradeep Singh
Payload!! 2010/10/14 Christoph Hermann herm...@informatik.uni-freiburg.de Hi, is there a way to store additional metadata with fields? My Problem is as follows: I'm extracting extended html with tika. This extended html contains references to pages, x,y values of the text etc. I want to

Re: IndexSearch very slow after reopening the index

2010-10-14 Thread Ian Lea
OK, so it looks like we're down to a more general why is searching slow question. The number of docs is not very large by lucene standards. Work through http://wiki.apache.org/lucene-java/ImproveSearchingSpeed. If that still doesn't help, pick a slow query and post again with: . the output of

Cannot view open issues in Hudson

2010-10-14 Thread David Clarke
Hey Guys Whenever I try to view open issues in hudson it doesn't display any information. Does anyone know why this is the case or how I could fix it? Thanks in advance -Dave Clarke

Re: Storing additional Metadata with Fields

2010-10-14 Thread Christoph Hermann
Am Donnerstag, 14. Oktober 2010, 12:29:43 schrieben Sie: Hello, is there a way to store additional metadata with fields? Example: I have the following content: htmlbody span page=1 x=1, y=1This is a very/span span page=1 x=1, y=2interesting text./span span page=2 x=1, y=1This is

Re: IndexSearch very slow after reopening the index

2010-10-14 Thread subwayne
Ok, I read the Wiki page related to improving the searching speed and adopted some advices. One of the slow queries is simply. Here are some: plaintext:guid 107.0 ms resultSet.totalHits = 1 plaintext:allianc 51.0 ms resultSet.totalHists = 1 plaintext:engin 46.0 ms resultSet.totalHits = 1

Use of Lucene to store data from RSS feeds

2010-10-14 Thread appy74
Hello I would like to store data retrieved hourly from RSS feeds in a database or in Lucene so that the text can be easily indexed for word frequencies. I need to get the text from the title and description elements of RSS items. Ideally, for each hourly retrieval from a given feed, I would

Re: Use of Lucene to store data from RSS feeds

2010-10-14 Thread Grant Ingersoll
On Oct 14, 2010, at 10:17 AM, app...@dsl.pipex.com wrote: Hello I would like to store data retrieved hourly from RSS feeds in a database or in Lucene so that the text can be easily indexed for word frequencies. I need to get the text from the title and description elements of RSS

ParallelReader

2010-10-14 Thread Nilesh Vijaywargiay
I have two index, A and B. Can two documents doc1[in index A] and doc2[in index B] have a common field? doc1 and doc2 have same document Id's.

RE: determining the type of a term - retrieving a payload

2010-10-14 Thread Sykes, Derek
Hey Grant, Fair point on the next(). In this case I'm iterating through the terms returned from a PrefixTermEnum so I know they're in the index. The analyser I'm using looks like this: public class TypeSavingAnalyzer extends StandardAnalyzer { public TypeSavingAnalyzer(Version version) {

proposed change to CharTokenizer

2010-10-14 Thread Mike Sokolov
Background: I've been trying to enable hit highlighting of XML documents in such a way that the highlighting preserves the well-formedness of the XML. I thought I could get this to work by implementing a CharFilter that extracts text from XML (somewhat like HTMLStripCharFilter, except I am

Re: ParallelReader

2010-10-14 Thread Erick Erickson
No. And you don't even want to try... Document IDs are NOT invariant. Particularly when you delete a document and optimize an index, all the documents that come after the deleted one get new doc IDs. Trying to keep these two indexes in synch will be a nightmare. Perhaps you could explain what

Re: ParallelReader

2010-10-14 Thread Nilesh Vijaywargiay
Hey Erick, Sure. * * *What I am trying to achieve:* A) Update a field in Index A B) When searching for that old field, it should be a miss. *How I achieved it* *Index 1 * Doc 1 - Field1, Value 1 Doc 2 - Field1, Value 1 *Index 2* Doc 1 - Field1, Modified_Value 1 Doc 2 - EMPTY Add index 2

Re: ParallelReader

2010-10-14 Thread Erick Erickson
This seems like far too much work if I'm reading things right. You can't update a field, but you #can# update a document which actually re-index that document under the covers (you have to have a way to uniquely identify the doc). Then, when you reopen your index reader, you'll only see the new

Re: ParallelReader

2010-10-14 Thread Rob Bygrave
Any case where it would break? If a query uses multiple fields it would break. That is, usually all the fields need to be in doc in index 2 - not just the modified one. On Fri, Oct 15, 2010 at 2:35 PM, Erick Erickson erickerick...@gmail.comwrote: This seems like far too much work if I'm