Ignoring XML tags when Indexing

2008-07-23 Thread Kalani Ruwanpathirana
Hi all, I am searching for a way to ignore XML tags in the input when indexing. Is there a built in functionality in Lucene to get this done? I am sorry if this was discussed before. I searched but couldn't find a clear solution. Thanks in advance Kalani -- Kalani Ruwanpathirana Department of C

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Yonik Seeley
On Wed, Jul 23, 2008 at 7:47 PM, Jamie <[EMAIL PROTECTED]> wrote: > Could this error be the result of the bad file descriptor close bug as > described in > http://256.com/gray/docs/misc/java_bad_file_descriptor_close_bug.shtml. Hmmm, that's an interesting read. Seems like maybe we should kill most

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Michael McCandless
I think that is the best strategy at this point. The head of 2.3 has a workaround (that so far *seems* to work around) for that JRE bug. Mike Jamie wrote: Hi I feel like we are having to tip toe across JRE bugs to get this to work right. I am definitely not pointing fingers, since the

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Jamie
Hi I feel like we are having to tip toe across JRE bugs to get this to work right. I am definitely not pointing fingers, since the issues and their resolutions are complex but I would appreciate some insight on the most reliable combination of JRE 6 and Lucene. I cannot downgrade the JRE to 5

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Jamie
Hi All, I found something interesting Could this error be the result of the bad file descriptor close bug as described in http://256.com/gray/docs/misc/java_bad_file_descriptor_close_bug.shtml. This would definitely fit the description since this happened on JRE 1.6u3 apparently, up

Re: How to avoid duplicate records in lucene

2008-07-23 Thread Chris Lu
Sebastin, Lucene is just like a plain database table. It doesn't have uniqueness constraint. So you can have two documents of the exact same content. What you should do is to check for duplication before adding. And if duplication is found, delete the old Document and add a new Document. This way

Re: deleting documents with doc id

2008-07-23 Thread Karl Wettin
23 jul 2008 kl. 22.08 skrev Cam Bazz: hello - if I make a query and get the document ids and delete with the document id - could there be a side effect? my index is committed periodically, but i can not say when it is committed. The only thing is that the deltions will not be visible u

Re: Search returns empty Documents

2008-07-23 Thread Johannes Dorn
Hello, i found my mistake. i forgot to add the files path to the document. greetings Johannes Dorn Am 23.07.2008 um 23:14 schrieb Johannes Dorn: Hello, I am quite new to Lucene. I've added a search to my application by combining the getting started application and the xml#1 contribution.

Search returns empty Documents

2008-07-23 Thread Johannes Dorn
Hello, I am quite new to Lucene. I've added a search to my application by combining the getting started application and the xml#1 contribution. Here are the methods used to generate the index. public static void generateIndex() { try { if (!docD

RE: Using lucene to search a bunch of keywords?

2008-07-23 Thread Steven A Rowe
On 07/23/2008 at 5:09 PM, Steven A Rowe wrote: > Karl Wettin's recently committed ShingleMatrixAnalyzer Oops, "ShingleMatrixAnalyzer" -> "ShingleMatrixFilter". Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional co

RE: Using lucene to search a bunch of keywords?

2008-07-23 Thread Steven A Rowe
Hi Ryan, Well, at 100 million+ keywords, Lucene might be the right tool. One thing that you might check out for the query side is Karl Wettin's recently committed ShingleMatrixAnalyzer (not in any Lucene release yet - only on the trunk). The JUnit test class TestShingleMatrixFilter has an exam

deleting documents with doc id

2008-07-23 Thread Cam Bazz
hello - if I make a query and get the document ids and delete with the document id - could there be a side effect? my index is committed periodically, but i can not say when it is committed. best regards, -c.b.

Re: Using lucene to search a bunch of keywords?

2008-07-23 Thread Ryan D
Heh, actually I'm using Perl but I've always associated text-search with Lucene, I'm not sure if it's the best solution or not. On the small side there are 1.6 million keywords, on the large side there are well over 100 million but I might find another way to break down the searches into sm

RE: Using lucene to search a bunch of keywords?

2008-07-23 Thread Steven A Rowe
Hi Ryan, I'm not sure Lucene's the right tool for this job. I have used regular expressions and ternary search trees in the past to do similar things. Is the set of keywords too large for an in-memory solution like these? If not, consider using a tool like the Perl package Regex::PreSuf

RE: Using lucene to search a bunch of keywords?

2008-07-23 Thread Robert Stewart
You need to invert the process. Using Lucene may not be the best option... You need to make your document a key into an index of key words. I've done the same thing, but not with Lucene. You need to pass through the document and for each word (token) lookup in some index (hashtable) to find po

Using lucene to search a bunch of keywords?

2008-07-23 Thread Ryan Detzel
Everything i've read and seen about luceen is search for keywords in documents; I want to do the reverse. I have a huge list of keywords("big boy","red ball","computer") and I have phrases that I want to see if they keywords are in. For example using the small keyword list above(store in documents

Re: lucene delete by query

2008-07-23 Thread Cam Bazz
how reliable is the version in the trunk? is it ok for production? On Wed, Jul 23, 2008 at 5:25 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > It's in the lucene trunk (current development version). > IndexWriter.deleteDocuments(Query query) > > -Yonik > > On Wed, Jul 23, 2008 at 9:53 AM, Cam Ba

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Michael McCandless
This log looks healthy. Mike Jamie wrote: Hi The index log file is attached. Many thanks in advance for your consideration! Jamie Jamie wrote: Wasn't there some index corruption issue with Java 1.6 and Lucene 2.3.2? Could this be the problem? Jamie Jamie wrote: Hi Matthew Thanks

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Michael McCandless
Mindaugas ?ak?auskas wrote: On Wed, Jul 23, 2008 at 5:46 PM, Matthew Hall <[EMAIL PROTECTED] > wrote: <..>As for the jave 1.6 lucene 2.3.2 index corruption issue <..> Correct me if I'm wrong, but doesnt' the particular Sun bug [1] only manifests itself if -Xbatch option is used? Also, the ex

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Mindaugas Žakšauskas
On Wed, Jul 23, 2008 at 5:46 PM, Matthew Hall <[EMAIL PROTECTED]> wrote: > <..>As for the jave 1.6 lucene 2.3.2 index corruption issue <..> Correct me if I'm wrong, but doesnt' the particular Sun bug [1] only manifests itself if -Xbatch option is used? Also, the exceptions mentioned in LUCENE-1282

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Michael McCandless
org.apache.lucene.index.CheckIndex checks the index for corruption, and (if you specify -fix) will "repair" the index by removing any segments that have problems. There is this issue with Java 1.6.0_0{4,5}: https://issues.apache.org/jira/browse/LUCENE-1282 Sun is making progress on fi

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Matthew Hall
I'm not sure which file in particular would be the one corrupter/missing, which is why I suggested looking at the index with luke. As for the jave 1.6 lucene 2.3.2 index corruption issue, I'm not 100% familiar with the details on that one, but as a quick test, you should be able to swap to a 1

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Jamie
Hi The index log file is attached. Many thanks in advance for your consideration! Jamie Jamie wrote: Wasn't there some index corruption issue with Java 1.6 and Lucene 2.3.2? Could this be the problem? Jamie Jamie wrote: Hi Matthew Thanks in advance for the suggestion. Which file do you

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Jamie
Wasn't there some index corruption issue with Java 1.6 and Lucene 2.3.2? Could this be the problem? Jamie Jamie wrote: Hi Matthew Thanks in advance for the suggestion. Which file do you think does not exist? This is what we have: _15zw.cfs _19od.cfs _1a5d.cfs _1a7n.cfs _1ahf.cfs _1ahh

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Jamie
Hi Matthew Thanks in advance for the suggestion. Which file do you think does not exist? This is what we have: _15zw.cfs _19od.cfs _1a5d.cfs _1a7n.cfs _1ahf.cfs _1ahh.cfs _qzl.cfs segments.gen _1993.cfs _1a0w.cfs _1a7c.cfs _1a9m.cfs _1ahg.cfs _1ahi.cfs segments_158j Aside

Re: Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Matthew Hall
Did you try to open the index using Luke? Luke will be able to tell you whether or not the index is in fact corrupted, but looking at your stack trace, it almost looks like the file.. simply isn't there? Matt Jamie wrote: Hi Everyone I am getting the the following error when executing Hi

Lucene Search Error: Java.io.IOException: Bad file descriptor

2008-07-23 Thread Jamie
Hi Everyone I am getting the the following error when executing Hits hits = searchers.search(query, queryFilter, sort): 18007414-java.io.IOException: Bad file descriptor 18007455- at java.io.RandomAccessFile.seek(Native Method) 18007504- at org.apache.lucene.store.FSDirectory$FSI

Re: What is the percent of size of lucene's index ?

2008-07-23 Thread Matthew Hall
You can also use Luke after you've created your indexes to get their exact size, and other interesting data points. Like Ian said though, the decisions you make on a field by field basis will make your index size vary quite a bit, so probably the best thing you could do is simply try it out, a

Re: What is the percent of size of lucene's index ?

2008-07-23 Thread Ian Lea
I think there are too many variables to give a simple answer. How much of your data are you storing? Indexing? Compressing? Get a representative sample of your data and try it out. -- Ian. On Wed, Jul 23, 2008 at 5:00 PM, Ariel <[EMAIL PROTECTED]> wrote: > I need to know what is the percent

What is the percent of size of lucene's index ?

2008-07-23 Thread Ariel
I need to know what is the percent of size of lucene's index respect the information I'm going to index, I have read some articles that say if a I index 120 Gb of information the index will grow until 40 Gb, that means the percent is 30 %, Could somebody tell me how can be proved that ? Is there an

Re: lucene delete by query

2008-07-23 Thread Yonik Seeley
It's in the lucene trunk (current development version). IndexWriter.deleteDocuments(Query query) -Yonik On Wed, Jul 23, 2008 at 9:53 AM, Cam Bazz <[EMAIL PROTECTED]> wrote: > hello, > > was not there a lucene delete by query feature coming up? I remember > something like that, but I could not fin

Re: storing the contents of a document in the lucene index

2008-07-23 Thread Erick Erickson
OK, I'm finally catching on. You have to change the demo code to get the contents into something besides an input stream, so you can use one of the alternate forms of the Field constructor. For instance, you could read it all into a string and use the form: doc.add(new Field("content", ,

lucene delete by query

2008-07-23 Thread Cam Bazz
hello, was not there a lucene delete by query feature coming up? I remember something like that, but I could not find an references. best regards, -c.b.

Re: Lucene write locks

2008-07-23 Thread Michael McCandless
This looks like a CLASSPATH issue. You need to make sure whatever jar contains that class is in the CLASSPATH when you run that line. Mike Sandeep K wrote: I have another problem also. While parsing the files during indexing I get the following error, java.lang.NoClassDefFoundError: de/s

Re: How to avoid duplicate records in lucene

2008-07-23 Thread Erick Erickson
Well, yes, that's expected behavior. Lucene makes no attempt to filter "substantially similar" documents. From Lucene's perspective, you added it twice, you must have had a good reason. And no, it doesn't really display the same document twice. There are two documents in Lucene that happen to have

Re: BooleanQuery$TooManyClauses: maxClauseCount is set to 1024

2008-07-23 Thread Ian Lea
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831 -- Ian. On Wed, Jul 23, 2008 at 12:26 PM, sandyg <[EMAIL PROTECTED]> wrote: > > Hi ALL, > > Please can u help how to overcome the exception > org.apache.lucene.search.BooleanQuery$TooManyClauses: maxCl

BooleanQuery$TooManyClauses: maxClauseCount is set to 1024

2008-07-23 Thread sandyg
Hi ALL, Please can u help how to overcome the exception org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024 i got this when am using prefix query like 123* -- View this message in context: http://www.nabble.com/BooleanQuery%24TooManyClauses%3A-maxClauseCount-

Re: Lucene write locks

2008-07-23 Thread Sandeep K
I have another problem also. While parsing the files during indexing I get the following error, java.lang.NoClassDefFoundError: de/sty/io/mimetype/MimeTypeResolver what is this? I use Lius-1.0 Its giving the error in the line where I have IndexerFactory.getIndexer(file, LiusConfig obj) plz help

Re: Lucene write locks

2008-07-23 Thread Michael McCandless
OK then this should be fine. That single machine, on receiving a JMS message, should use a single IndexWriter for making changes to the index (ie, it should not try to open a 2nd IndexWriter while a previous one is still working on a previous message). Mike Sandeep K wrote: Thanks a

Re: Lucene write locks

2008-07-23 Thread Sandeep K
Thanks a lot Mike, There will be only one machine which uses IndexWriter and its the JMS server. This server will first create the file in the physical file system(its Linux) and then index the saved file. Michael McCandless-2 wrote: > > > Sandeep K wrote: > >> >> Hi all.. >> I had a questi

Re: Lucene write locks

2008-07-23 Thread Michael McCandless
Sandeep K wrote: Hi all.. I had a question related to the write locks created by Lucene. I use Lucene 2.3.2. Will this newwer version create locks while indexing as older ones? or is there any other way that lucene handles its operations? It still creates write locks, which are used to en

Re: storing the contents of a document in the lucene index

2008-07-23 Thread starz10de
Hi Erik, I don't remove the stop words, as I index parallel corpora which is used for learning the translations between pair of languages. so every word is important. I even develop my own analyzer for Arabic which is just remove punctuations and special symbols and it return only Arabic text.