Lucene and XML Architecture

2007-07-19 Thread Thomas
Hi all, As part of my diploma thesis I'm starting to work on an information retrieval solution for a law and business publisher. Currently I'm trying to define a flexible and scalable architecture. All the data is present in XML-form and at the moment simply stored on the file system and

RE: Lucene shows parts of search query as a HIT

2007-07-19 Thread Ard Schrijvers
Hello Askar, Which analyzer are you using for indexing and searching? If you use an analyzer that uses stemming, you might see that change, changing, changed, chan etc al get reduced to the same word chan. In luke you can test with plugins that show you what tokens are created from your

RE: Inrease the performance of Indexing in Lucene

2007-07-19 Thread Ard Schrijvers
Hello, Did take a look at nutch or hadoop or solr? They partially seem to address the things you describe...About the LSI I am not sure what has been done in those projects Regards Ard Hi, Please help me. Its been a month since i am trying lucene. My requirements are huge, i have to

Re: Lucene and XML Architecture

2007-07-19 Thread Patrick Turcotte
Hi, There is a Lucene-eXist trigger that allows you to do just that. Take a look at patch http://sourceforge.net/tracker/index.php?func=detailaid=1654205group_id=17691atid=317691 Then, from exist, you can search either with XQuery or Lucene syntax. Patrick Thomas wrote: My intention is to

Re: Inrease the performance of Indexing in Lucene

2007-07-19 Thread miztaken
But will it be possible to rename the Field's name inside Lucene Document. I know its not possible to change the value of the Document's Field but can we change the field's name. Any Ideas... I am totally petrified of googling. -- View this message in context:

question about flush(), optimize(), and deleted documents

2007-07-19 Thread Donna L Gresh
I have run into problems with an error that I am trying to access a deleted document when doing something along the lines below; my brief question is, what is necessary to avoid seeing deleted documents? Is an optimize() necessary? Or will a flush() or close() accomplish the same thing?

Re: question about flush(), optimize(), and deleted documents

2007-07-19 Thread Mark Miller
All deletes should be removed after an optimize. Otherwise, I think you'll have to call isDeleted before trying to access the document. numDocs does not include deletes, but the document() call will retrieve deletes. You might try using maxDoc() instead of numDocs(). - Mark

Re: Lucene shows parts of search query as a HIT

2007-07-19 Thread Erick Erickson
You say there's only one document and you added many. The line IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(), true); blows away any existing index data and starts over. If you're calling this fragment for each document, you'll always have only one doc. Try changing the

Re: Inrease the performance of Indexing in Lucene

2007-07-19 Thread Erick Erickson
Well, instead of googling, look at the lucene searchable archive. It's linked to from the lucene home page. I have no clue whether what you want is already in there, but there is a wealth of info there. Erick On 7/19/07, miztaken [EMAIL PROTECTED] wrote: But will it be possible to rename

Re: question about flush(), optimize(), and deleted documents

2007-07-19 Thread Donna L Gresh
Thank you-- this is what appeared to be the case, but I wanted to check if there was something simple I wasn't understanding-- All deletes should be removed after an optimize. Otherwise, I think you'll have to call isDeleted before trying to access the document. numDocs does not include

Re: Lucene shows parts of search query as a HIT

2007-07-19 Thread Askar Zaidi
Yes, I realized that. Now I have all the documents in the Index. I'll play around with Luke to see what can stop stemming. thanks ! AZ On 7/19/07, Erick Erickson [EMAIL PROTECTED] wrote: You say there's only one document and you added many. The line IndexWriter writer = new

Where exact score is getting calculate?

2007-07-19 Thread Bhavin Pandya
Hi, The score i am getting in DocCollector is raw score... which is not necessary between 0 and 1. Where lucene exactly calculating the final score...? Or what if i want final score in DocCollector ??? How to ??? Regards. Bhavin pandya

Lucene newbie

2007-07-19 Thread Yom Chouloute
Hello All I am working on a couple of projects that require some search engine capabilities. I came across Lucene and I think that it might be good tool to incorporate into the project. I started implementing the software but got some error messages that prevent me from going further.

Re: Where exact score is getting calculate?

2007-07-19 Thread Erick Erickson
I don't think you can using a HitCollector. If you used a TopDocs instead, you have access to the maximum score and can normalize the scores to between 0 and 1, but I don't know if that suits your needs. Erick On 7/19/07, Bhavin Pandya [EMAIL PROTECTED] wrote: Hi, The score i am getting in

distinct query how to???

2007-07-19 Thread Bhavin Pandya
Hi erick, Thanks for your prompt reply... Let me explain what i m doing There is lucene query which returns relevant result when i am searching through Hits object. But when i m using same query using DocCollector ( I want this way because want to remove duplicate records at search time

Re: Lucene shows parts of search query as a HIT

2007-07-19 Thread Askar Zaidi
Hey Erik, How can I change the default Lucene OR property to AND. When I tried query.toString(), I got contents:w contents:chan contents: kim Thats fine, but its doing OR, how can I make it AND so that it shows: contents: W Chan Kim ?? thanks a ton ! AZ On 7/19/07, Erick Erickson [EMAIL

Re: Lucene shows parts of search query as a HIT

2007-07-19 Thread Erick Erickson
QueryParser.setDefaultOperator On 7/19/07, Askar Zaidi [EMAIL PROTECTED] wrote: Hey Erik, How can I change the default Lucene OR property to AND. When I tried query.toString(), I got contents:w contents:chan contents: kim Thats fine, but its doing OR, how can I make it AND so that it

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-19 Thread Mark Miller
I think it goes without saying that a semi-complex NFA or DFA is going to be quite a bit slower than say, breaking on whitespace. Not that I am against such a warning. To support my point on writing a custom solution that is more exact towards your needs: If you just remove the NUM

Re: distinct query how to???

2007-07-19 Thread Mark Miller
You get non relevant results because normally a HitCollector will only collect documents with scores greater than 0. Hits normalizes raw scores like this: if (hitDocs.size() min) { min = hitDocs.size(); } int n = min * 2;// double # retrieved TopDocs topDocs = (sort ==

Re: Lucene newbie

2007-07-19 Thread karl wettin
19 jul 2007 kl. 15.48 skrev Yom Chouloute: My time frame at this moment will not allow me so get the full grasp of that software so if anybody on that list would like to do some contract work you can contact me at Hi Yom, you can find human resources at this page:

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-19 Thread Michael Stoppelman
On 7/19/07, Mark Miller [EMAIL PROTECTED] wrote: I think it goes without saying that a semi-complex NFA or DFA is going to be quite a bit slower than say, breaking on whitespace. Not that I am against such a warning. This is true to those very familiar with the code base and the Tokenizer

Re: encoding question.

2007-07-19 Thread Peter Keegan
The source data for my index is already in standard UTF-8 and available as a simple byte array. I need to do some simple tokenization of the data (check for whitespace and special characters that control position increment). What is the most efficient way to index this data and avoid unnecessary

Question about lucene query (+body:12) (+title:12) ?

2007-07-19 Thread li hao cho
Hi all, I use query (+body:12) (+title:12) , but I got some wrong message bellow: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:137) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)

TermFreqVector

2007-07-19 Thread Kevin Chen
I need to use getTermFreqVector on a subset of docs that belong to the hits for a query. I understand I need to pass the docNumber as an argument in this case. How do I access that. For ex . doc = hits.doc(0); TermFreqVector vector = reader.getTermFreqVector(docId, field); How do I get

Re: TermFreqVector

2007-07-19 Thread karl wettin
19 jul 2007 kl. 22.58 skrev Kevin Chen: doc = hits.doc(0); TermFreqVector vector = reader.getTermFreqVector(docId, field); How do I get docId? If you use Hits, it is hits.doc() -- karl - To unsubscribe, e-mail: [EMAIL

Re: TermFreqVector

2007-07-19 Thread Akanksha Baid
hits.id() should work. karl wettin wrote: 19 jul 2007 kl. 22.58 skrev Kevin Chen: doc = hits.doc(0); TermFreqVector vector = reader.getTermFreqVector(docId, field); How do I get docId? If you use Hits, it is hits.doc()

Re: Question about lucene query (+body:12) (+title:12) ?

2007-07-19 Thread Mark Miller
Hopefully someone will be able to give you some further insight into this. To me, it looks like a corrupted index. If TermVectors where not stored, at worst you should be seeing a NullPointerException. Has this index had anything interesting happen to it? Made with an older version of Lucene,

How to open the term vector storage?

2007-07-19 Thread savageboy
Hello, everyone: doc.add(Field.Unstored(subject, subject, true)); This syntax above is for Lucene1.4 I need the syntax which could do the same work for Lucene2.0 Could you help me? Thank you very much! -- View this message in context:

RE: How to open the term vector storage?

2007-07-19 Thread Jun.Chen
I also have this problem... Field.Text Field.Keyword ... I cannot find this method in lucene2.0 API -Original Message- From: savageboy [mailto:[EMAIL PROTECTED] Sent: 2007年7月20日 9:46 上午好,Daniel To: java-user@lucene.apache.org Subject: How to open the term vector storage? Hello,

RE: How to open the term vector storage?

2007-07-19 Thread Chris Hostetter
: I also have this problem... : Field.Text : Field.Keyword : ... : I cannot find this method in lucene2.0 API please see the FAQ How do I get code written for Lucene 1.4.x to work with Lucene 2.x? http://wiki.apache.org/lucene-java/LuceneFAQ#head-86d479476c63a2579e867b75d4faa9664ef6cf4d