Re: Count for a keyword occurance in a file
On Thursday 29 April 2004 08:14, Nader S. Henein wrote: > Tricky, scoring has to do with the frequency of the occurrence of the word > as opposed to the amount of words in the file in general (Somebody correct > me if I'm wrong) , so short of an educated approximation, you could hack Lucene uses two frequencies for a term: the nr. of docs in which it occurs in an index (basis for IDF), and the nr of times a term occurs in a document. > the indexer to dynamically store the frequency of a word (oh so > unadvisable). Personally I recommend the educated approximation, because > you could index the document with the number of words in it ( you would > have to make sure you're not using Stop Word Analyzer or Port Stemmer) and > then based on the score reverse engineer the result you want. > > Nader Henein > > -Original Message- > From: hemal bhatt [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 28, 2004 5:50 PM > To: Lucene Users List > Subject: Count for a keyword occurance in a file > > > Hi, > > How can I get a count of the score given by Hits.Score(). > i.e I want to know how many times a keyword occurs in a file. Any help on > this would be appreciated. The easiest way is to use IndexReader. I don't know what you mean by file (index or document), but you can have both frequencies I mentioned above from an IndexReader, evt. using skipTo() to go to the document. The methods are docFreq(Term) and termDocs(Term). Regards, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Bug Luke
Hi! The search works not correctly c RussianAnalyzer allocating stems. It(he) searches only for words conterminous with stem. For example, WildCard the search gives another result. Thanks, Vladimir. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Count for a keyword occurance in a file
Tricky, scoring has to do with the frequency of the occurrence of the word as opposed to the amount of words in the file in general (Somebody correct me if I'm wrong) , so short of an educated approximation, you could hack the indexer to dynamically store the frequency of a word (oh so unadvisable). Personally I recommend the educated approximation, because you could index the document with the number of words in it ( you would have to make sure you're not using Stop Word Analyzer or Port Stemmer) and then based on the score reverse engineer the result you want. Nader Henein -Original Message- From: hemal bhatt [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 28, 2004 5:50 PM To: Lucene Users List Subject: Count for a keyword occurance in a file Hi, How can I get a count of the score given by Hits.Score(). i.e I want to know how many times a keyword occurs in a file. Any help on this would be appreciated. regards Hemal Bhatt regards Hemal bhatt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Documents the same search is done many times.
The short answer is, it's up to you :-) Lucene doesn't know which document is your primary key (you're thinking like a DB programmer) id you ad the new document with ID="one" without deleting the old one from the index then when you search you'll get two documents "pig" and "mongoose" but if you delete all documents with ID="one" then index you're new document then you'll only get "mongoose", From a DBA perspective Lucene is like a table with a unique ID on each document (that being the Lucene assigned DOC ID (which changes every time you optimize, but nevertheless remains unique) and then all other columns weather indexed, tokenized, stored or not, can bare repetition, so if you want to implement a unique key like ID on your Lucene index, you 'll have to do a little delete based on that ID field every time you insert a new document into the index, quite simple and I've been doing it or a few years now without fail. Hope this helps Nader Henein - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
DEFAULT_OPERATOR_AND
Hi! I have lucene1.4-rc3-dev. TestQueryParser works with RussianAnalyzer(RussianCharsets.CP1251) and russian terms. ... public Query getQueryDOA(String query, Analyzer a) throws Exception { if (a == null) a = new RussianAnalyzer(RussianCharsets.CP1251); // a = new SimpleAnalyzer(); QueryParser qp = new QueryParser("field", a); qp.setOperator(QueryParser.DEFAULT_OPERATOR_AND); return qp.parse(query); } ... In a reality QueryParser work as QueryParser.DEFAULT_OPERATOR_OR after set QueryParser.DEFAULT_OPERATOR_AND. For example: 1. Query: (after set DEFAULT _ OPERATOR _ AND): term1 term2 term3 Result : term1 OR term2 OR term3 2. Query: +term1 +term2 +term3 Result : term1 AND term2 AND term3 Please, help to decide this problem? Thanks, Vladimir. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining text search + relational search
On Wednesday 28 April 2004 11:00, [EMAIL PROTECTED] wrote: > Bascially I want to limit the results of the text search by the rows that > are returned in a relational search of other attribute data related to the > document. The text of the document is just like any other attribute it just > needs to be queried differently. Does that make sense? Yes. But why not just store textual content in one Lucene field, and metadata in one or more separate fields? You can then easily build queries to combine searches. And as long as metadata values are normalized, added index size is probably insignificant compared to full indexed text content. -+ Tatu +- > > Thanks > Mike > > > > > > > > Stephane James > Vaucher > <[EMAIL PROTECTED] To > qc.ca>Lucene Users List ><[EMAIL PROTECTED]> > 04/28/2004 10:38 cc > AM >Subject >Re: Combining text search + > Please respond to relational search >"Lucene Users >List" > <[EMAIL PROTECTED] > rta.apache.org> > > > > > > > I'm a bit confused why you want this. > > As far as I know, but relational db searches will return exact > matches without a mesure of relevancy. To mesure relevancy, you need a > search engine. For your results to be coherent, you would have to put > everything in the lucene index. > > As for memory consumption, for searching, if the index is on disk, then > the memory footprint depends on the type of queries you use. For indexing, > it depends if you use tmp RAMDirectory to do merges, otherwise, memory > consumption is minimal. > > HTH > sv > > On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote: > > I need to somehow aloow users to do a text search and query relational > > database attributes at the same time. The attributes are basically > > metadata > > > about the documents that the text search will be perfomed on. I have the > > text of the documents indexed in Lucene. Does anyone have any advice or > > examples. I also need to make sure I don't garble up all the memory on > > our > > > server > > > > Thanks > > Mike > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Documents the same search is done many times.
what happens in this situation: I query a field "id" for "one" say I get a Document object (object A) from a search which has a field "content" with value "pig". and that object persists forever. then a new index is written with a document with "id"="one" and "content"="mongoose" another search does the same search querying "id" for "one". will this new search return the same object A or a new object? if they are different, will examining object A show that the "content" field has changed? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Created LockObtainTimedOut wiki page
I just created a LockObtainTimedOut wiki entry... feel free to add. I just entered the Tomcat issue with java.io.tmpdir as well. http://wiki.apache.org/jakarta-lucene/LockObtainTimedOut Peace! -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: lucene applicability and performance
Greg, > Yes, see RemoteSearchable and MultiSearcher in org.apache.lucene.search. > (See the javadoc on the website) I meant ParallelMultiSearcher. Good night, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene applicability and performance
Greg, On Wednesday 28 April 2004 21:44, Greg Conway wrote: > Hello. Apologies if this has come up before, I'm new to the list and > didn't see anything in the archives that exactly matched my situation. It has, but each situation is different. Try this: http://jakarta.apache.org/lucene/docs/benchmarks.html > I am considering using Lucene to index and search a large collection of > small documents in a specialized domain -- probably only a few > > thousands unique terms spanning across anywhere from one million to ten > million small source documents. I hope to be able to get ranked search > results back in less than 400 msec. > > I suspect one issue I may face is index density owing to the large > numbers of documents and relatively small vocabulary. That, in turn, > may be a drag on query processing. I am working on strategies to > ameliorate that somewhat but it may be difficult. A text search engine is your best bet in this situation. > In the meantime, I'm looking for some gut reactions from the experts > before I take this to the next stage. Can Lucene scale well to this > kind of situation? Can I realistically hope to get anywhere near my Yes. > performance targets? Will I have to distribute pieces of the index Yes. > across several machines, parallelize my retrievals, and merge the That's more difficult to say. You'll need to try. > results to do so? If so, does Lucene already support that or will I Yes, see RemoteSearchable and MultiSearcher in org.apache.lucene.search. (See the javadoc on the website) But first make sure that the Analyzer you use for indexing fits your needs. > have to develop that logic in house? (Seems like I saw a reference No. > somewhere that such a feature was coming soon, but I'm not sure when or > how it will be implemented.) Have fun, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 'Lock obtain timed out' even though NO locks exist...
Gus Kormeier wrote: Not sure if our installation is the same or not, but we are also using Tomcat. I had a similiar problem last week, it occurred after Tomcat went through a hard restart and some software errors had the website hammered. I found the lock file in /usr/local/tomcat/temp/ using locate. According to the README.txt this is a directory created for the JVM within Tomcat. So it is a system temp directory, just inside Tomcat. Man... you ROCK! I didn't even THINK of that... Hm... I wonder if we should include the name of the lock file in the Exception within Tomcat. That would probably have saved me a lot of time :) Either that or we can put this in the wiki Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Bug in Sandbox - Berkeley DB
IndexReader.delete(int docid) doesn't work with the Berkeley DB implementation of org.apache.lucene.store.Directory This error message appears when closing an IndexReader which has a deletion: PANIC: Invalid argument I get this stack trace: java.io.IOException: DB_RUNRECOVERY: Fatal error, run database recovery at org.apache.lucene.store.db.Block.put(Block.java:128) at org.apache.lucene.store.db.DbOutputStream.close(DbOutputStream.java:111) at org.apache.lucene.util.BitVector.write(BitVector.java:155) at org.apache.lucene.index.SegmentReader$1.doBody(SegmentReader.java:162) at org.apache.lucene.store.Lock$With.run(Lock.java:148) at org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java:157) at org.apache.lucene.index.IndexReader.close(IndexReader.java:422) Help! - andy g code that triggers this: // dbdir is a working DbDirectory, docid is a search result IndexReader read = IndexReader.open(dbdir); read.delete(docid); read.close(); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: 'Lock obtain timed out' even though NO locks exist...
Not sure if our installation is the same or not, but we are also using Tomcat. I had a similiar problem last week, it occurred after Tomcat went through a hard restart and some software errors had the website hammered. I found the lock file in /usr/local/tomcat/temp/ using locate. According to the README.txt this is a directory created for the JVM within Tomcat. So it is a system temp directory, just inside Tomcat. Hope that helps, -Gus -Original Message- From: Kevin A. Burton [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 28, 2004 1:01 PM To: Lucene Users List Subject: Re: 'Lock obtain timed out' even though NO locks exist... James Dunn wrote: >Which version of lucene are you using? In 1.2, I >believe the lock file was located in the index >directory itself. In 1.3, it's in your system's tmp >folder. > > Yes... 1.3 and I have a script that removes the locks from both dirs... This is only one process so it's just fine to remove them. >Perhaps it's a permission problem on either one of >those folders. Maybe your process doesn't have write >access to the correct folder and is thus unable to >create the lock file? > > I thought about that too... I have plenty of disk space so that's not an issue. Also did a chmod -R so that should work too. >You can also pass lucene a system property to increase >the lock timeout interval, like so: > >-Dorg.apache.lucene.commitLockTimeout=6 > >or > >-Dorg.apache.lucene.writeLockTimeout=6 > > I'll give that a try... good idea. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 'Lock obtain timed out' even though NO locks exist...
James Dunn wrote: Which version of lucene are you using? In 1.2, I believe the lock file was located in the index directory itself. In 1.3, it's in your system's tmp folder. Yes... 1.3 and I have a script that removes the locks from both dirs... This is only one process so it's just fine to remove them. Perhaps it's a permission problem on either one of those folders. Maybe your process doesn't have write access to the correct folder and is thus unable to create the lock file? I thought about that too... I have plenty of disk space so that's not an issue. Also did a chmod -R so that should work too. You can also pass lucene a system property to increase the lock timeout interval, like so: -Dorg.apache.lucene.commitLockTimeout=6 or -Dorg.apache.lucene.writeLockTimeout=6 I'll give that a try... good idea. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
Kevin A. Burton wrote: Actually this is exactly the problem... I ran some single index tests and a single process seems to read from it. The problem is that we were running under Tomcat with diff webapps for testing and didn't run into this problem before. We had an 11G index that just took a while to open and during this open Lucene was creating a lock. I wasn't sure that Tomcat was multithreading this so maybe it is and it's just taking longer to open the lock in some situations. This is strange... after removing all the webapps (besides 1) Tomcat still refuses to allow Lucene to open this index with Lock obtain timed out. If I open it up from the console it works just fine. I'm only doing it with one index and a ulimit -n so it's not a files issue. Memory is 1G for Tomcat. If I figure this out will be sure to send a message to the list. This is a strange one Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
lucene applicability and performance
Hello. Apologies if this has come up before, I'm new to the list and didn't see anything in the archives that exactly matched my situation. I am considering using Lucene to index and search a large collection of small documents in a specialized domain -- probably only a few thousands unique terms spanning across anywhere from one million to ten million small source documents. I hope to be able to get ranked search results back in less than 400 msec. I suspect one issue I may face is index density owing to the large numbers of documents and relatively small vocabulary. That, in turn, may be a drag on query processing. I am working on strategies to ameliorate that somewhat but it may be difficult. In the meantime, I'm looking for some gut reactions from the experts before I take this to the next stage. Can Lucene scale well to this kind of situation? Can I realistically hope to get anywhere near my performance targets? Will I have to distribute pieces of the index across several machines, parallelize my retrievals, and merge the results to do so? If so, does Lucene already support that or will I have to develop that logic in house? (Seems like I saw a reference somewhere that such a feature was coming soon, but I'm not sure when or how it will be implemented.) Any help, tips, references, or advice would be welcome and appreciated. Thank you! Regards, Greg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 'Lock obtain timed out' even though NO locks exist...
[EMAIL PROTECTED] wrote: It is possible that a previous operation on the index left the lock open. Leaving the IndexWriter or Reader open without closing them ( in a finally block ) could cause this. Actually this is exactly the problem... I ran some single index tests and a single process seems to read from it. The problem is that we were running under Tomcat with diff webapps for testing and didn't run into this problem before. We had an 11G index that just took a while to open and during this open Lucene was creating a lock. I wasn't sure that Tomcat was multithreading this so maybe it is and it's just taking longer to open the lock in some situations. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
Which version of lucene are you using? In 1.2, I believe the lock file was located in the index directory itself. In 1.3, it's in your system's tmp folder. Perhaps it's a permission problem on either one of those folders. Maybe your process doesn't have write access to the correct folder and is thus unable to create the lock file? You can also pass lucene a system property to increase the lock timeout interval, like so: -Dorg.apache.lucene.commitLockTimeout=6 or -Dorg.apache.lucene.writeLockTimeout=6 The above sets the timeout to one minute. Hope this helps, Jim --- "Kevin A. Burton" <[EMAIL PROTECTED]> wrote: > I've noticed this really strange problem on one of > our boxes. It's > happened twice already. > > We have indexes where when Lucnes starts it says > 'Lock obtain timed out' > ... however NO locks exist for the directory. > > There are no other processes present and no locks in > the index dir or /tmp. > > Is there anyway to figure out what's going on here? > > Looking at the index it seems just fine... But this > is only a brief > glance. I was hoping that if it was corrupt (which > I don't think it is) > that lucene would give me a better error than "Lock > obtain timed out" > > Kevin > > -- > > Please reply using PGP. > > http://peerfear.org/pubkey.asc > > NewsMonster - http://www.newsmonster.org/ > > Kevin A. Burton, Location - San Francisco, CA, Cell > - 415.595.9965 >AIM/YIM - sfburtonator, Web - > http://peerfear.org/ > GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D > 8D04 99F1 4412 > IRC - freenode.net #infoanarchy | #p2p-hackers | > #newsmonster > > > ATTACHMENT part 2 application/pgp-signature name=signature.asc __ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: 'Lock obtain timed out' even though NO locks exist...
It is possible that a previous operation on the index left the lock open. Leaving the IndexWriter or Reader open without closing them ( in a finally block ) could cause this. Anand -Original Message- From: Kevin A. Burton [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 28, 2004 2:57 PM To: Lucene Users List Subject: 'Lock obtain timed out' even though NO locks exist... I've noticed this really strange problem on one of our boxes. It's happened twice already. We have indexes where when Lucnes starts it says 'Lock obtain timed out' ... however NO locks exist for the directory. There are no other processes present and no locks in the index dir or /tmp. Is there anyway to figure out what's going on here? Looking at the index it seems just fine... But this is only a brief glance. I was hoping that if it was corrupt (which I don't think it is) that lucene would give me a better error than "Lock obtain timed out" Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
'Lock obtain timed out' even though NO locks exist...
I've noticed this really strange problem on one of our boxes. It's happened twice already. We have indexes where when Lucnes starts it says 'Lock obtain timed out' ... however NO locks exist for the directory. There are no other processes present and no locks in the index dir or /tmp. Is there anyway to figure out what's going on here? Looking at the index it seems just fine... But this is only a brief glance. I was hoping that if it was corrupt (which I don't think it is) that lucene would give me a better error than "Lock obtain timed out" Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
RE: ArrayIndexOutOfBoundsException
Philippe, thanks for the reply. I didn't FTP my index anywhere, but your response does make it seem that my index is in fact corrupted somehow. Does anyone know of a tool that can verify the validity of a Lucene index, and/or possibly repair it? If not, anyone have any idea how difficult it would be to write one? Thanks, Jim --- Phil brunet <[EMAIL PROTECTED]> wrote: > > Hi. > > I had this problem when i transfered a Lucene index > by FTP in "ASCII" mode. > Using binary mode, i never has such a problem. > > Philippe > > >From: James Dunn <[EMAIL PROTECTED]> > >Reply-To: "Lucene Users List" > <[EMAIL PROTECTED]> > >To: [EMAIL PROTECTED] > >Subject: ArrayIndexOutOfBoundsException > >Date: Mon, 26 Apr 2004 12:15:39 -0700 (PDT) > > > >Hello all, > > > >I have a web site whose search is driven by Lucene > >1.3. I've been doing some load testing using > JMeter > >and occassionally I will see the exception below > when > >the search page is under heavy load. > > > >Has anyone seen similar errors during load testing? > > > >I've seen some posts with similar exceptions and > the > >general consensus is that this error means that the > >index is corrupt. I'm not sure my index is corrupt > >however. I can run all the queries I use for load > >testing under normal load and I don't appear to get > >this error. > > > >Is there any way to verify that a Lucene index is > >corrupt or not? > > > >Thanks, > > > >Jim > > > >java.lang.ArrayIndexOutOfBoundsException: 53 >= 52 > > at > java.util.Vector.elementAt(Vector.java:431) > > at > >org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135) > > at > >org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103) > > at > >org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275) > > at > >org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112) > > at > >org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107) > > at > >org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100) > > at > >org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100) > > at > >org.apache.lucene.search.Hits.doc(Hits.java:130) > > > > > > > > > > > >__ > >Do you Yahoo!? > >Yahoo! Photos: High-quality 4x6 digital prints for > 25¢ > >http://photos.yahoo.com/ph/print_splash > > > >- > >To unsubscribe, e-mail: > [EMAIL PROTECTED] > >For additional commands, e-mail: > [EMAIL PROTECTED] > > > > _ > Hotmail : un compte GRATUIT qui vous suit partout et > tout le temps ! > http://g.msn.fr/FR1000/9493 > > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > __ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining text search + relational search
Create a Lucene index from data in DB, and make sure to include PKs in one of the fields (use Field.Keyword). Then query your RDBMS and get back the ResultSet. Then get the PK from each ResultSet and use it to construct a Lucene BooleanQuery, which should include your original query string AND returned PKs combined with OR. That is, if I understand what yo uare trying to do :) Otis --- [EMAIL PROTECTED] wrote: > > > > > Bascially I want to limit the results of the text search by the rows > that > are returned in a relational search of other attribute data related > to the > document. The text of the document is just like any other attribute > it just > needs to be queried differently. Does that make sense? > > Thanks > Mike > > > > > > > > > Stephane James > > Vaucher > > <[EMAIL PROTECTED] > To > qc.ca>Lucene Users List > > > <[EMAIL PROTECTED]> > 04/28/2004 10:38 > cc > AM > > > Subject >Re: Combining text search + > > Please respond to relational search > >"Lucene Users > >List" > > <[EMAIL PROTECTED] > > rta.apache.org> > > > > > > > > > > I'm a bit confused why you want this. > > As far as I know, but relational db searches will return exact > matches without a mesure of relevancy. To mesure relevancy, you need > a > search engine. For your results to be coherent, you would have to put > everything in the lucene index. > > As for memory consumption, for searching, if the index is on disk, > then > the memory footprint depends on the type of queries you use. For > indexing, > it depends if you use tmp RAMDirectory to do merges, otherwise, > memory > consumption is minimal. > > HTH > sv > > On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote: > > > > > I need to somehow aloow users to do a text search and query > relational > > database attributes at the same time. The attributes are basically > metadata > > about the documents that the text search will be perfomed on. I > have the > > text of the documents indexed in Lucene. Does anyone have any > advice or > > examples. I also need to make sure I don't garble up all the memory > on > our > > server > > > > Thanks > > Mike > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining text search + relational search
Bascially I want to limit the results of the text search by the rows that are returned in a relational search of other attribute data related to the document. The text of the document is just like any other attribute it just needs to be queried differently. Does that make sense? Thanks Mike Stephane James Vaucher <[EMAIL PROTECTED] To qc.ca>Lucene Users List <[EMAIL PROTECTED]> 04/28/2004 10:38 cc AM Subject Re: Combining text search + Please respond to relational search "Lucene Users List" <[EMAIL PROTECTED] rta.apache.org> I'm a bit confused why you want this. As far as I know, but relational db searches will return exact matches without a mesure of relevancy. To mesure relevancy, you need a search engine. For your results to be coherent, you would have to put everything in the lucene index. As for memory consumption, for searching, if the index is on disk, then the memory footprint depends on the type of queries you use. For indexing, it depends if you use tmp RAMDirectory to do merges, otherwise, memory consumption is minimal. HTH sv On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote: > > I need to somehow aloow users to do a text search and query relational > database attributes at the same time. The attributes are basically metadata > about the documents that the text search will be perfomed on. I have the > text of the documents indexed in Lucene. Does anyone have any advice or > examples. I also need to make sure I don't garble up all the memory on our > server > > Thanks > Mike > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Read past EOF and negative bufferLength problem (1.4 rc2)
Daniel, Everything works fine with the latest CVS version of lucene. It looks like the bug I hit was the one that you referenced in your email which is now fixed. Thanks for your help. . .. . ...joe Daniel Naber wrote: Am Dienstag, 27. April 2004 21:00 schrieb Joe Berkovitz: Using Lucene 1.4 rc2 I've run into a fatal problem: Could you try with the latest version from CVS? Several severe problems have been fixed, but I'm not sure if yours was one of them. Also see http://issues.apache.org/bugzilla/show_bug.cgi?id=27587 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining text search + relational search
I'm a bit confused why you want this. As far as I know, but relational db searches will return exact matches without a mesure of relevancy. To mesure relevancy, you need a search engine. For your results to be coherent, you would have to put everything in the lucene index. As for memory consumption, for searching, if the index is on disk, then the memory footprint depends on the type of queries you use. For indexing, it depends if you use tmp RAMDirectory to do merges, otherwise, memory consumption is minimal. HTH sv On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote: > > I need to somehow aloow users to do a text search and query relational > database attributes at the same time. The attributes are basically metadata > about the documents that the text search will be perfomed on. I have the > text of the documents indexed in Lucene. Does anyone have any advice or > examples. I also need to make sure I don't garble up all the memory on our > server > > Thanks > Mike > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Re-associate a token with its source
Thank you, but I think I didn't explained my problem clearly enough. I have four positions (top, bottom, right and left) for each one of the words of the document so I would have to store in the index the content of the page with the positions in the middle. org.apache.lucene.document.Field#UnIndexed("content", "house 1142 1231 3212 2214 dog 2213 2432 3214 2134 ...") In order to get the values after a search I would need to parse the document returned to find the positions that are next to the searched word. I have seen that the class Token has 4 properties: beginColumn, beginLine, endColumn and endLine and I don't know if it is possible to use them to store for each token the position that I want. I think this approach is not the correct one so any help on this would be appreciated. Olaia. -Mensaje original- De: Stephane James Vaucher [mailto:[EMAIL PROTECTED] Enviado el: martes, 27 de abril de 2004 21:46 Para: Lucene Users List Asunto: Re: Re-associate a token with its source When indexing, use UnIndexed fields to store this data in your document. org.apache.lucene.document.Field#UnIndexed(String name, String value) Add the fields using: org.apache.lucene.document.Document.add(Field) After your search, you can get the field value from: Document Hits.doc(int) You can retrieve your store values using String Document.get(String name) HTH, sv On Tue, 27 Apr 2004, Olaia Vázquez Sánchez wrote: > Hello > > > > I have documents in XML in which, for each word, I have 4 positions (top, > down, left and right) that would let me to highlight this word in a jpg > image. I want to index this XML documents and to highlight the results of > the queries in the image, so I need to store this positions for each word > inside the index. > > > > I was searching about how can I use the Token fields to store this > attributes but I didnt found any example where this fields were used. > > > > Thanks, > > > > Olaia Vázquez > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Combining text search + relational search
I need to somehow aloow users to do a text search and query relational database attributes at the same time. The attributes are basically metadata about the documents that the text search will be perfomed on. I have the text of the documents indexed in Lucene. Does anyone have any advice or examples. I also need to make sure I don't garble up all the memory on our server Thanks Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[Lucene] XML Indexing
XMLIndexingDemo seems not able to index traditional Chinese characters. I can only search for English text and not Chinese. In fact, my XML document contains both Chinese and English text. How can I fix this problem? Is it necessary for me to convert the Chinese characters in BIG5 to UTF-8 before doing the file indexing? If it is, then how can we do it? This problem won't happen on indexing bilingual HTML files (Chinese & English) with Lucene Demo HTML parser. 必殺技、飲歌、小星星... 浪漫鈴聲 情心連繫 http://ringtone.yahoo.com.hk/
Count for a keyword occurance in a file
Hi, How can I get a count of the score given by Hits.Score(). i.e I want to know how many times a keyword occurs in a file. Any help on this would be appreciated. regards Hemal Bhatt regards Hemal bhatt
Re: Segments file get deleted?!
Hi Thanks for reply. I got that error in my previous build. Now i didnt see it at all. Also i couldnt able to retain the log. I will definetly come back if i see it again. Anyway below is my machine config: Windows XP Personal Ed., 512MB, P4. My app server is Resin 2.1.12 I will definetly come up with more details when i get it again.Thanks again. Surya - Original Message - From: "Nader S. Henein" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Monday, April 26, 2004 12:42 PM Subject: RE: Segments file get deleted?! Can you give us a bit of background, we've been using Lucene since the first stable release 2 years ago, and I 've never had segments disappear on me, first of all can you provide some background on your setup and secondly when you say "a certain period of time", how much time are we talking about here and does that interval coincide with your indexing schedule, because you may have the create flag on the Indexer set to true so it simply recreates the index at every update and deleted whatever was there, of course if there are no files to index at any point it will just give you a blank index. Nader Henein -Original Message- From: Surya Kiran [mailto:[EMAIL PROTECTED] Sent: Monday, April 26, 2004 7:48 AM To: [EMAIL PROTECTED] Subject: Segments file get deleted?! Hi all, we have implemented our portal search using Lucene. It works fine. But after a certain period of time "Lucene segments" file get deleted. Eventually all searches fails. Anyone can guess where the error could be. Thanks a lot. Regards Surya. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: ArrayIndexOutOfBoundsException
Hi. I had this problem when i transfered a Lucene index by FTP in "ASCII" mode. Using binary mode, i never has such a problem. Philippe From: James Dunn <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Subject: ArrayIndexOutOfBoundsException Date: Mon, 26 Apr 2004 12:15:39 -0700 (PDT) Hello all, I have a web site whose search is driven by Lucene 1.3. I've been doing some load testing using JMeter and occassionally I will see the exception below when the search page is under heavy load. Has anyone seen similar errors during load testing? I've seen some posts with similar exceptions and the general consensus is that this error means that the index is corrupt. I'm not sure my index is corrupt however. I can run all the queries I use for load testing under normal load and I don't appear to get this error. Is there any way to verify that a Lucene index is corrupt or not? Thanks, Jim java.lang.ArrayIndexOutOfBoundsException: 53 >= 52 at java.util.Vector.elementAt(Vector.java:431) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275) at org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107) at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100) at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100) at org.apache.lucene.search.Hits.doc(Hits.java:130) __ Do you Yahoo!? Yahoo! Photos: High-quality 4x6 digital prints for 25¢ http://photos.yahoo.com/ph/print_splash - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Hotmail : un compte GRATUIT qui vous suit partout et tout le temps ! http://g.msn.fr/FR1000/9493 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: status of LARM project
Kelvin is all correct. A few years ago there were no quality open source crawlers available. There are now a number of very good ones. Archive.org's crawler is available, there is Larbin, Nutch, etc. LARM works, it's just not maintained any more. Otis --- Kelvin Tan <[EMAIL PROTECTED]> wrote: > As far as I know, LARM is defunct. I read somewhere, perhaps > apocryphal, that > Clemens got a job which wasn't supportive of his continued > development on LARM. > AFAIK there aren't any other active developers of LARM (at least at > the time it > branched off to SF). > > Otis recently posted to use Nutch instead of LARM. > > Kelvin > > On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said: > > Hi > > > > I have look at LARM website and I get different results > > > > http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages > > It says that development has stopped for this project. > > > > LARM hosted on sourceforge. > > The last message was dated 2003 in the mailing list. Is it still > > supported and active? > > > > LARM hosted on apache. > > It says the project is moved to sourceforge. > > > > Any one here who is active in LARM can comment on the status? > > > > Regards > > > > Sebastian Ho > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]