Re: Search Problem

Amin Mohammed-Coleman Sat, 03 Jan 2009 10:08:10 -0800

Hi again

Sorry I didn't include the WorkItem class! Here is the final testcase. Apologies!


On 3 Jan 2009, at 14:02, Grant Ingersoll wrote:

You shouldn't need to call close and optimize after each document.

You also don't need the commit if you are going to immediately close.
Also, can you send a standalone test that shows the RTF extraction,the document creation and the indexing code that demonstrates yourissue.
FWIW, and as a complete aside to save you some time after you getthis figured out, instead of re-inventing RTF extraction and PDFextraction (as you appear to be doing), have a look at Tika (http://lucene.apache.org/tika)
On Jan 3, 2009, at 8:48 AM, Shashi Kant wrote:
Amin,

Are you calling Close & Optimize after every addDocument?

I would suggest something like this
try
{
while //this could be your looping through a data reader forexample
     {
          indexWriter.addDocument(document);
     }
}

finally
{
commitAndOptimise()
}


HTH

Shashi


----- Original Message ----
From: Amin Mohammed-Coleman <[email protected]>
To: [email protected]
Sent: Saturday, January 3, 2009 4:02:52 AM
Subject: Re: Search Problem


Hi again!
I think I may have found the problem but I was wondering if youcould verify:
I have the following for my indexer:

public void add(Document document) {
IndexWriter indexWriter =IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
      try {
          indexWriter.addDocument(document);
          LOGGER.debug("Added Document:" + document + " to index");
          commitAndOptimise(indexWriter);
      } catch (CorruptIndexException e) {
          throw new IllegalStateException(e);
      } catch (IOException e) {
          throw new IllegalStateException(e);
      }
  }

the commitAndOptimise(indexWriter) looks like this:
private void commitAndOptimise(IndexWriter indexWriter) throwsCorruptIndexException,IOException {
      LOGGER.debug("Committing document and closing index writer");
      indexWriter.optimize();
      indexWriter.commit();
      indexWriter.close();
  }
It seems as though if I comment out optimize then the overview tabin Luke for the rtf document looks like:
5    id    1234
3    body    document
3    body    body
1    body    test
1    body    rtf
1    name    rtfDocumentToIndex.rtf
1    body    new
1    path    rtfDocumentToIndex.rtf
1    summary    This is a
1    type    RTF_INDEXER
1    body    content
This is more what I expected although "Amin Mohammed-Coleman"hasn't been stored in the index. Should I not be usingindexWriter.optimize() ?
I tried using the search function in luke and got the followingresults:
body:test ---> returns result
body:document ---> no result
body:content ---> no result
body:rtf ----> returns result
Thanks again...sorry to be sending so many emails about this. I amin the process of designing and developing a prototype of adocument and domain indexing/searching component and I would liketo demo to the rest of my team.
Cheers
Amin



On 3 Jan 2009, at 01:23, Erick Erickson wrote:
Well, your query results are consistent with what Luke is
reporting. So I'd go back and test your assumptions. I
suspect that you're not indexing what you think you are.

For your test document, I'd just print out what you're indexing
and the field it's going into. *for each field*. that is, everytime you
do a document.add(<field of some kind>), print out that data. I'm
pretty sure you'll find that you're not getting what you expect. For
instance, the call to:

MetaDataEnum.BODY.getDescription()

may be returning some nonsense. Or
bodyText.trim()

isn't doing what you expect.

Lucene is used by many folks, and errors of the magnitude you're
experiencing would be seen by many people and the user list would
be flooded with complaints if it were a Lucene issue at root. That
leaves the code you wrote as the most likely culprit. So try avery simple
test case with lots of debugging println's. I'm pretty sure you'll
find the underlying issue with some of your assumptions prettyquickly.
Sorry I can't be more specific, but we'd have to see all of yourcode
and the test cases to do that....

Best
Erick
On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <[email protected]>wrote:
Hi Erick

Thanks for your reply.
I have used luke to inspect the document and I am some whatconfused. Forexample when I view the index using the overview tab of Luke Iget the
following:

1       body    test
1       id      1234
1       name    rtfDocumentToIndex.rtf
1       path    rtfDocumentToIndex.rtf
1       summary This is a
1       type    RTF_INDEXER
1       body    rtf
However when I view the document in the Document tab I get thefull text
that was extracted from the rft document (field:body) which is:

This is a test rtf document that will be indexed.
Amin Mohammed-Coleman
I am using the StandardAnaylzer therefore I wouldnt expect thewords
document, indexed, Amin Mohammed-Coleman to be removed.
I have referenced the Lucene In Action book and I can't see whatI may bedoing wrong. I would be happy to provide a testcase should it berequired.
When adding the body field to the document I am doing:

    Document document = new Document();
                    Field field = new
Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),Field.Store.YES,
Field.Index.ANALYZED);
                    document.add(field);
When I run the search code the string "test" is the only wordthat returnsa result (TopDocs), whereas the others do not (e.g. "amin","document",
"indexed").

Thanks again for your help and advice.


Cheers
Amin




On 2 Jan 2009, at 21:20, Erick Erickson wrote:

Casing is usually handled by the analyzer. Since you construct
the term query programmatically, it doesn't go through
any analyzers, thus is not converted into lower case for
searching as was done automatically for you when you
indexed using StandardAnalyzer.

As for why you aren't getting hits, it's unclear to me. But
what I'd do is get a copy of Luke and examine your index
to see what's *really* there. This will often give you clues,
usually pointing to some kind of analyzer behavior that you
weren't expecting.

Best
Erick

On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <[email protected]
wrote:
Hi
I have tried this and it doesn't work. I don't understand whyusing
"amin"
instead of "Amin" would work, is it not case insensitive?
I tried "test" for field "body" and this works. Any otherterms don't
work
for example:

"document"
"indexed"
these are tokens that were extracted when creating the lucenedocument.
Thanks for your reply.

Cheers

Amin


On 2 Jan 2009, at 10:36, Chris Lu wrote:
Basically Lucene stores analyzed tokens, and looks up for thematches
based
on the tokens.
"Amin" after StandardAnalyzer is "amin", so you need to use new
Term("body",
"amin"), instead of new Term("body", "Amin"), to search.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:


http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous perrequest)
got
2.6 Million Euro funding!

On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
[email protected]
wrote:
Hi
Sorry I was using the StandardAnalyzer in this instance.

Cheers




On 2 Jan 2009, at 00:55, Chris Lu wrote:

You need to let us know the analyzer you are using.

-- Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:



http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous perrequest)
got
2.6 Million Euro funding!

On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
[email protected]

wrote:
Hi
I have created a RTFHandler which takes a RTF file andcreates a
lucene
Document which is indexed. The RTFHandler looks likesomething like
this:

if (bodyText != null) {
              Document document = new Document();
              Field field = new
Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
Field.Store.YES,
Field.Index.ANALYZED);
              document.add(field);


}
I am using Java Built in RTF text extraction. When I runmy test toverify that the document contains text that I expect thisworks
fine.
I
get
the following when I print the document:
Document<stored/uncompressed,indexed,tokenized<body:Thisis a test
rtf
document that will be indexed.

Amin Mohammed-Coleman>
stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
stored/uncompressed,indexed<type:RTF_INDEXER>
stored/uncompressed,indexed<summary:This is a >>
The problem is when I use the following to search I get noresult:
MultiSearcher multiSearcher = new MultiSearcher(newSearchable[]
{rtfIndexSearcher});
              Term t = new Term("body", "Amin");
              TermQuery termQuery = new TermQuery(t);
TopDocs topDocs =multiSearcher.search(termQuery,
1);
              System.out.println(topDocs.totalHits);
              multiSearcher.close();
RftIndexSearcher is configured with the directory thatholds rtfdocuments. I have used Luke to look at the document andwhat I am
finding
in the overview tab is the following for the document:

1       body    test
1       id      1234
1       name    rtfDocumentToIndex.rtf
1       path    rtfDocumentToIndex.rtf
1       summary This is a
1       type    RTF_INDEXER
1       body    rtf
However on the Document tab I am getting (in the bodyfield):
This is a test rtf document that will be indexed.

Amin Mohammed-Coleman
I would expect to get a hit using "Amin" or even"document". I am
not
sure whether the
line:
TopDocs topDocs = multiSearcher.search(termQuery, 1);
is incorrect as I am not too sure of the meaning of "Findsthe top n
hits
for query." for search (Query query, int n) according tojava docs.
I would be grateful if someone may be able to advise onwhat I may
be
doing wrong.  I am using Lucene 2.4.0


Cheers
Amin
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Search Problem

Reply via email to