RE: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

2009-09-01 Thread Uwe Schindler
There may be a problem that you may not want to restore the peek token into the TokenFilter's attributes itsself. It looks like you want to have a Token instance returned from peek, but the current Stream should not reset to this Token (you only want to "look" into the next Token and then possibly

Deletion of words in articles of Wikipedia

2009-09-01 Thread Sahi
Hi, I'm new to this site. My question is: Articles in wikipedia can be edited by everyone and may or may not be accurate. If any contributor writes an article and then another contributor deletes certain content in that article would indicate that the article is controversial. I need to start o

RE: exception to open a large index Insufficient system resources exist

2009-09-01 Thread Fang_Li
32 bit JVM, 1.3G allocated heap size, Lucene 2.4.1. In my option, this exception should not be caused by out of memory or out of system file handle because different exception should be thrown for these two cases. Any hint? Thanks, Fang, Li -Original Message- From: Uwe Schindler [mail

Re: Purpose of the file modification date methods in Directory?

2009-09-01 Thread cemerick
Fair enough. This Directory impl is really only useful in conjunction with other usage of the jdbm embedded database -- I can't imagine people layering other Lucene-dependent projects on top (like Solr or whatever). I suppose if that time ever comes, I'll revisit the issue. :-) - Chas hossman

Re: Performance diffs between filter.bits() and searcher.docFreq()

2009-09-01 Thread Chris Hostetter
: new TermsFilter( termsList:[ new Term( 'id', '111' ) ] ).bits( indexReader : ).cardinality() ... : indexReader.docFreq( new Term( 'id', '111' ) ) ... : Which one is faster? Can I replace the 2nd one with the 1st and still get : the same performance? "the second", and "no" -Ho

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

2009-09-01 Thread Michael Busch
This is what I had in mind (completely untested!): public class LookaheadTokenFilter extends TokenFilter { /** List of tokens that were peeked but not returned with next. */ LinkedList peekedTokens = new LinkedList(); /** The position of the next character that peek() will return in pee

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

2009-09-01 Thread Michael Busch
Daniel, take a look at the captureState() and restoreState() APIs in AttributeSource and TokenStream. captureState() returns a State object containing all attributes with its' current values. restoreState(State) takes a given State and copies its values back into the TokenStream. You should b

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

2009-09-01 Thread Daniel Shane
After thinking about it, the only conclusion I got was instead of saving the token, to save an iterator of Attributes and use that instead. It may work. Daniel Shane Daniel Shane wrote: Hi all! I'm trying to port my Lucene code to the new TokenStream API and I have a filter that I cannot se

Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

2009-09-01 Thread Daniel Shane
Hi all! I'm trying to port my Lucene code to the new TokenStream API and I have a filter that I cannot seem to port using the current new API. The filter is called LookaheadTokenFilter. It behaves exactly like a normal token filter, except, you can call peek() and get information on the next

Re: New "Stream closed" exception with Java 6

2009-09-01 Thread Daniel Shane
I think you should do this instead (it will print the exception message *and* the stack trace instead of only the message) : throw new IndexerException ("CorruptIndexException on doc: " + doc.toString(), ex); Daniel Shane Chris Bamford wrote: Hi Grant, I think you code there needs to sh

Re: Purpose of the file modification date methods in Directory?

2009-09-01 Thread Chris Hostetter
: Right, and those methods (IndexReader.lastModified and : IndexCommit.lastModified) aren't used at all. I guess what I meant to say they aren't used internal in Lucene, but they are part of the public API, so client code (apps built using Lucene) may expect them to work for their own internal

Field.Store.NO & Field.Index.NOT_ANALYZED & hashCode

2009-09-01 Thread Christian
Hi, I am putting some text into a field which we set to Field.Store.NO & Field.Index.NOT_ANALYZED. We are then doing exact & fuzzy matches against that text (about the length of an average sentence). Currently, we have a single field that is being used for both exact and fuzzy matches while we hav

Re: Question about IndexCommit

2009-09-01 Thread Chris Hostetter
: Subject: Question about IndexCommit : In-Reply-To: <9ac0c6aa0909010403k3306307dxa7751ecff3fa2...@mail.gmail.com> http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, inst

Re: Why perform optimization in 'off hours'?

2009-09-01 Thread Chris Hostetter
: Subject: Why perform optimization in 'off hours'? : In-Reply-To: : <5b20def02611534db08854076ce825d8032db...@sc1exc2.corp.emainc.com> http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to

Field.Store.NO & Field.Index.NOT_ANALYZED & hashCode

2009-09-01 Thread Christian
Hi, I am putting some text into a field which we set to Field.Store.NO & Field.Index.NOT_ANALYZED. We are then doing exact & fuzzy matches against that text (about the length of an average sentence). Currently, we have a single field that is being used for both exact and fuzzy matches while we hav

RE: Query and language conversion

2009-09-01 Thread Steven A Rowe
Alex, That's right, you'll have to roll your own if you want to do cross-language search in Lucene. But some of the components you need are available. Two possible cross-language search strategies don't scale well when the document collection size is non-trivial: a) translating all documents i

Re: Query and language conversion

2009-09-01 Thread Alex
Many thanks Steve for all that information. I understand by your answer that cross-lingual search doesn't come "out-of -the-box" in Lucene. Cheers. Alex On Tue, Sep 1, 2009 at 6:46 PM, Steven A Rowe wrote: > Hi Alex, > > What you want to do is commonly referred to as "Cross Language Informati

RE: Query and language conversion

2009-09-01 Thread Steven A Rowe
Hi Alex, What you want to do is commonly referred to as "Cross Language Information Retrieval". Doug Oard at the University of Maryland has a page of CLIR resources here: http://terpconnect.umd.edu/~dlrg/clir/ Grant Ingersoll responded to a similar question a couple of years ago on this lis

Re: Lucene gobbling file descriptors

2009-09-01 Thread Erick Erickson
Late reply, but do what Michael said. I didn't understand that you hadso many indexes, but opening/closing makes sense to me now, reusing them wouldn't be very useful. Although I *can* imagine a "keep each open for 5 minute" sort of rule being useful on the assumption that a user might search sever

Query and language conversion

2009-09-01 Thread Alex
Hi, I am new to Lucene so excuse me if this is a trivial question .. I have data that I Index in a given language (English). My users will come from different countries and my search screen will be internationalized. My users will then probably query thing in their own language. Is it possible t

Re: Question about IndexCommit

2009-09-01 Thread Ted Stockwell
That's excellent. Thanks very much for the explanations - Original Message > From: Michael McCandless > To: java-user@lucene.apache.org > Sent: Tuesday, September 1, 2009 8:26:45 AM > Subject: Re: Question about IndexCommit > > Further, when IndexWriter writes new .del files, it's

RE: Lucene gobbling file descriptors

2009-09-01 Thread Chris Bamford
Thanks Mike, I get what you mean now :-) BTW I have tested the code with 1 open/close per search (rather than keeping the IndexReader open between searches) and so far I have witnessed no change in performance! I will experiment with bigger indexes, but early signs are very encouraging :-)

Re: Question about IndexCommit

2009-09-01 Thread Michael McCandless
Further, when IndexWriter writes new .del files, it's always to a new (next generation) filename, so that the old .del file remains present. This means if a fresh IndexReader is opened, it will load the old .del file, and still not see any of IndexWriter's pending changes. Mike On Tue, Sep 1, 20

Re: Question about IndexCommit

2009-09-01 Thread Shai Erera
If I'm not mistaken, IndexReader reads the .del file into memory, and therefore subsequent updates to it won't be visible to it. Shai On Tue, Sep 1, 2009 at 3:54 PM, Ted Stockwell wrote: > Hi All, > > I am interested in using Lucene to index RDF (Resource Description Format) > data. > Ultimatel

Question about IndexCommit

2009-09-01 Thread Ted Stockwell
Hi All, I am interested in using Lucene to index RDF (Resource Description Format) data. Ultimately I want to create a transactional interface to the data with proper transaction isolation. Therefore I am trying to educate myself on the details of index readers and writers, I am using v2.9rc2.

Re: Lucene gobbling file descriptors

2009-09-01 Thread Michael McCandless
setUseCompoundFile is an IndexWriter method. It already defaults to "true", so you probably are already using compound file format. If you look in your index directory and see only *.cfs (plus segments_N and segments.gen) then you are using compound file format. Mike On Tue, Sep 1, 2009 at 8:20

RE: Lucene gobbling file descriptors

2009-09-01 Thread Chris Bamford
Hi Mike, Thanks for the suggestions, very useful. I would like to adopt a combination of setUseCompoundFile on the IndexReader and perform an open/close per search. As a start, I just tried to set compound file format on the IndexSearcher's underlying IndexReader, but it is not available as a m

Re: Lucene gobbling file descriptors

2009-09-01 Thread Michael McCandless
In this approach it's expected you'll run out of file descriptors, when "enough" users attempt to search at the same time. You can reduce the number of file descriptors required per IndexReader by 1) using compound file format (it's the default for IndexWriter), and 2) optimizing the index before

RE: Lucene gobbling file descriptors

2009-09-01 Thread Chris Bamford
Hi Erick, >>Note that for search speed reasons, you really, really want to share your >>readers and NOT open/close for every request. I have often wondered about this - I hope you can help me understand it better in the context of our app, which is an email client: When one of our users receives

RE: exception to open a large index Insufficient system resources exist

2009-09-01 Thread Uwe Schindler
Which Lucene version, 64 bit JVM? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: fang...@emc.com [mailto:fang...@emc.com] > Sent: Tuesday, September 01, 2009 12:04 PM > To: java-user@lucene.apache.org >

RE: exception to open a large index Insufficient system resources exist

2009-09-01 Thread Fang_Li
We are running on Windows 2003 Enterprise Edition with NTFS file system on a local disc. JDK version is 1.5.0.12. The problem was discussed before and there is no clear solution confirmed. Thanks. -Original Message- From: Danil ŢORIN [mailto:torin...@gmail.com] Sent: Tuesday, September

Re: exception to open a large index Insufficient system resources exist

2009-09-01 Thread Danil ŢORIN
There should be no problem with large segments. Please describe OS, FileSystem and JDK you are running on. There might be some problems with file >2Gb on Win32/FAT, or in some ancient Linuxes. On Tue, Sep 1, 2009 at 12:37, wrote: > I met a problem to open an index bigger than 8GB and the followi

exception to open a large index Insufficient system resources exist

2009-09-01 Thread Fang_Li
I met a problem to open an index bigger than 8GB and the following exception was thrown. There is a segment which is bigger than 4GB already. After searching internet, it is said that not using compound index may solve the problem. The same exception was thrown when merging with another index happ

RE: New "Stream closed" exception with Java 6

2009-09-01 Thread Chris Bamford
Hi Grant, >>I think you code there needs to show the underlying exception, too, so >>we can see that stack trace. Ummm... isn't this code already doing that? What am I missing? try { indexWriter.addDocument(doc); } catch (CorruptIndexException ex) {

Re: Score for query-generated value

2009-09-01 Thread Michael McCandless
Function queries should work here? (org.apache.lucene.search.function.*). Mike On Tue, Sep 1, 2009 at 2:24 AM, marquinhocb wrote: > > I would like to create a scorer that applies a score based on a value that is > calculated during a query.  More specifically, to apply a score based on > geograph