Re: Indexing multiple languages

2005-06-03 Thread Andy Roberts
On Friday 03 Jun 2005 01:06, Bob Cheung wrote:
> For the StandardAnalyzer, will it have to be modified to accept
> different character encodings.
>
> We have customers in China, Taiwan and Hong Kong.  Chinese data may come
> in 3 different encoding:  Big5, GB and UTF8.
>
> What is the default encoding for the StandardAnalyser.

The analysers themselves do not worry about encodings, per se. Java uses 
Unicode strings throughout, which is adequate enough to describing all 
languages.  When reading in text files, it's a matter of letting the reader 
know which encoding the file is in, this helps Java to read in the text, and 
essentially map that encoding to the Unicode encoding. All the string 
operations, like analysing are done on these Unicode strings.

So, the task is making sure the file reader you use to open a document for 
indexing is given the required information for correctly decoding your file. 
If you don't specify, Java will use one based on the locale that your OS 
uses. For me, that's Latin1 as I'm in Britain. This clearly is inadequate for 
non-Latin texts and wouldn't be able to read in Chinese texts properly as the 
Latin1 encoding doesn't support such characters. You need to specify Big5 
yourself. Read the info on InputStreamReaders:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStreamReader.html

Andy

>
> Btw, I did try running the lucene demo (web template) to index the HTML
> files after I added one including English and Chinese characters.  I was
> not able to search for any Chinese in that HTML file (returned no hits).
> I wonder whether I need to change some of the java programs to index
> Chinese and/or accept Chinese as search term.  I was able to search for
> the HTML file if I used English word that appeared in the added HTML
> file.
>
> Thanks,
>
> Bob
>
>
> On May 31, 2005, Erik wrote:
>
> Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
> will keep English as-is (removing stop words, lowercasing, and such)
> and separate CJK characters into separate tokens also.
>
>  Erik
>
> On May 31, 2005, at 5:49 PM, jian chen wrote:
> > Hi,
> >
> > Interesting topic. I thought about this as well. I wanted to index
> > Chinese text with English, i.e., I want to treat the English text
> > inside Chinese text as English tokens rather than Chinese text tokens.
> >
> > Right now I think maybe I have to write a special analyzer that takes
> > the text input, and detect if the character is an ASCII char, if it
> > is, assembly them together and make it as a token, if not, then, make
> > it as a Chinese word token.
> >
> > So, bottom line is, just one analyzer for all the text and do the
> > if/else statement inside the analyzer.
> >
> > I would like to learn more thoughts about this!
> >
> > Thanks,
> >
> > Jian
> >
> > On 5/31/05, Tansley, Robert <[EMAIL PROTECTED]> wrote:
> >> Hi all,
> >>
> >> The DSpace (www.dspace.org) currently uses Lucene to index metadata
> >> (Dublin Core standard) and extracted full-text content of documents
> >> stored in it.  Now the system is being used globally, it needs to
> >> support multi-language indexing.
> >>
> >> I've looked through the mailing list archives etc. and it seems it's
> >> easy to plug in analyzers for different languages.
> >>
> >> What if we're trying to index multiple languages in the same
> >> site?  Is
> >> it best to have:
> >>
> >> 1/ one index for all languages
> >> 2/ one index for all languages, with an extra language field so
> >> searches
> >> can be constrained to a particular language
> >> 3/ separate indices for each language?
> >>
> >> I don't fully understand the consequences in terms of performance for
> >> 1/, but I can see that false hits could turn up where one word
> >> appears
> >> in different languages (stemming could increase the changes of this).
> >> Also some languages' analyzers are quite dramatically different (e.g.
> >> the Chinese one which just treats every character as a separate
> >> token/word).
> >>
> >> On the other hand, if people are searching for proper nouns in
> >> metadata
> >> (e.g. "DSpace") it may be advantageous to search all languages at
> >> once.
> >>
> >>
> >> I'm also not sure of the storage and performance consequences of 2/.
> >>
> >> Approach 3/ seems like it might be the most complex from an
> >> implementation/code point of view.
> >>
> >> Does anyone have any thoughts or recommendations on this?
> >>
> >> Many thanks,
> >>
> >>  Robert Tansley / Digital Media Systems Programme / HP Labs
> >>   http://www.hpl.hp.com/personal/Robert_Tansley/
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> 

RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Max Pfingsthorn
Hi,

when IndexSearcher.search gives you a Hits object back, all results are already 
sorted by their score, which is computed internally using the Similarity. You 
can access it via Hits.score(n) (see 
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Hits.html). 
This is also shown in the demo in org.apache.lucene.demo.SearchFiles from SVN. 
(see 
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/demo/org/apache/lucene/demo/SearchFiles.java?rev=150739&view=markup).

Hope that helps.
max


-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 21:22
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.


Ok.  So if I get 10 Documents back from a search and I want to get the top 5 
weighted terms for each of the 10 documents what API call should I use?  I'm 
unable to find the connection between Similarity and a Document.

I know I'm missing the elephant that must be in the middle of the room.  Or 
maybe it's not there.
Is what I'm trying to do do-able?

Thanks,

Andrew

-Original Message-
From: Max Pfingsthorn <[EMAIL PROTECTED]>
Sent: Jun 2, 2005 5:33 AM
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.

Hi,

DefaultSimilarity uses exactly this weighting scheme. Makes sense since it's a 
pretty standard relevance measure...

Bye!
max

-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 11:39
To: java-user@lucene.apache.org
Subject: calculate wi = tfi * IDFi for each document.


If I have search results how can I calculate, using lucene's API,  wi = tfi * 
IDFi for each document.

wi= term weight
tfi= term frequency in a document
IDFi = inverse document frequency = log(D/dfi)
dfi   = document frequency or number of documents containing term i
D= number of documents in my search result

Thanks,

Andrew

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: managing docids for ParallelReader (was Augmenting an existing index)

2005-06-03 Thread Markus Wiederkehr
On 5/31/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> > I have wondered about this as well. Are there any *sure fire* ways of
> > creating (and updating) two indices so that doc numbers in one index
> > deliberately correspond to doc numbers in the other index?
> 
> If you add the documents in the same order to both indexes and perform
> the same deletions on both indexes then they'll have the same numbers.

Would it be possible to write an IndexReader that combines two indexes
by a common field, for example a document ID? And how performant would
such an implementation be?

Markus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Erik Hatcher


On Jun 2, 2005, at 9:02 PM, Chris Hostetter wrote:

This is a pretty interesting problem.  I envy you.

I would avoid the existing highlighter for your purposes --  
highlighting

in token space is a very differnet problem from "highlihgting" in 2D
space.

based on the XML sample you provided, it looks like your XML files
are allready a "tokenized" form of the orriginal OCR data -- by  
which i

mean the page has allready been tokenized into words who position is
recorded.

I would parse these XML docs to generate two things:
1) a stream of words for analysis/filtering (ie: stop words,  
stemming,

   synonyms)
2) a datastructure mapping words to lists of positions (ie: if the
   same word apears in multiple places, list the word once,  
followed

   by each set of coordinates)

use #1 in the usual way, and add a serialized form of #2 to your  
index as
a Stored Keyword -- at query time, the words from your initial  
query can

be looked up in that data strucutre to find the regions to "highlight"


Chris - that is great recommendation.  I second it.  The only minor  
thing I'll add is that you probably should use an unindexed field for  
#2 rather than literally a Field.Keyword - no point in indexing it as  
you would never search on that data structure.


Erik





: I am involved in a project which is trying to provide searching  
and hit highlighting on the scanned image of historical  
newspapers.  We have an XML based OCR format.  A sample is below.   
We need to index the CONTENT attribute of the String element which  
is the easy part.  We would like to be able find the "hits" within  
this XML document in order to use the positioning information to  
draw the highlight boxes on the image.  It doesn't make a lot of  
sense to just extract the CONTENT and index that because we loose  
the positioning information.  My second thought was to make a  
custom analyzer which dropped everything except for the content  
element and then used the highlighting class in the sandbox to  
reanalyze the XML document and mark the hits.  With the marked hits  
in the XML we could find the position information and draw on the  
image.  Has anyone else worked with OCR information and lucene.   
What was your approach?  Does this approach seem sound?  Any  
recommendations?

:
: Thanks, Corey
:
:  VPOS="123644.0">
:   HPOS="1316.0" VPOS="123644.0" CONTENT="The" WC="1.0"/>

:   
:   HPOS="1664.0" VPOS="123711.0" CONTENT="female" WC="1.0"/>

:   
:   HPOS="2192.0" VPOS="123711.0" CONTENT="lays" WC="1.0"/>

:   
:   HPOS="2528.0" VPOS="123711.0" CONTENT="about" WC="1.0"/>

:   
:   HPOS="3000.0" VPOS="123770.0" CONTENT="140" WC="1.0"/>

:   
:   HPOS="3316.0" VPOS="124223.0" CONTENT="eggs" WC="1.0"/>

:   
:  
:
:
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-06-03 Thread Erik Hatcher


On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:
Btw, I did try running the lucene demo (web template) to index the  
HTML
files after I added one including English and Chinese characters.   
I was
not able to search for any Chinese in that HTML file (returned no  
hits).

I wonder whether I need to change some of the java programs to index
Chinese and/or accept Chinese as search term.  I was able to search  
for

the HTML file if I used English word that appeared in the added HTML
file.


Bob - Andy provided thorough information on the StandardAnalyzer  
issue (in short, it deals with Unicode directly not encodings).  As  
for the Lucene demo - you will have to adjust it to read the files in  
the proper encoding.  The IndexFiles program indexes files using the  
default encoding which won't be sufficient for your purpose.  The two  
files to check are HtmlDocument and FileDocument.  These files read  
the HTML and text files that the demo indexes.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Grant Ingersoll
I think the TermFreqVector (reader.getTermVector) has the info you want
per document.  You will need to sort it by frequency to get the top
terms in each document.  It doesn't give you the wi, just tfi, but the
whole score is implied by the fact that you have the top 10 documents, I
think.

-Grant

>>> [EMAIL PROTECTED] 6/2/2005 3:21:35 PM >>>
Ok.  So if I get 10 Documents back from a search and I want to get the
top 5 weighted terms for each of the 10 documents what API call should I
use?  I'm unable to find the connection between Similarity and a
Document.

I know I'm missing the elephant that must be in the middle of the room.
 Or maybe it's not there.
Is what I'm trying to do do-able?

Thanks,

Andrew

-Original Message-
From: Max Pfingsthorn <[EMAIL PROTECTED]>
Sent: Jun 2, 2005 5:33 AM
To: java-user@lucene.apache.org 
Subject: RE: calculate wi = tfi * IDFi for each document.

Hi,

DefaultSimilarity uses exactly this weighting scheme. Makes sense since
it's a pretty standard relevance measure...

Bye!
max

-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 02, 2005 11:39
To: java-user@lucene.apache.org 
Subject: calculate wi = tfi * IDFi for each document.


If I have search results how can I calculate, using lucene's API,  wi =
tfi * IDFi for each document.

wi= term weight
tfi= term frequency in a document
IDFi = inverse document frequency = log(D/dfi)
dfi   = document frequency or number of documents containing term i
D= number of documents in my search result

Thanks,

Andrew

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-06-03 Thread Grant Ingersoll
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages

>>> [EMAIL PROTECTED] 6/3/2005 6:03:31 AM >>>

On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:
> Btw, I did try running the lucene demo (web template) to index the  
> HTML
> files after I added one including English and Chinese characters.   
> I was
> not able to search for any Chinese in that HTML file (returned no  
> hits).
> I wonder whether I need to change some of the java programs to index
> Chinese and/or accept Chinese as search term.  I was able to search 

> for
> the HTML file if I used English word that appeared in the added HTML
> file.

Bob - Andy provided thorough information on the StandardAnalyzer  
issue (in short, it deals with Unicode directly not encodings).  As  
for the Lucene demo - you will have to adjust it to read the files in 

the proper encoding.  The IndexFiles program indexes files using the  
default encoding which won't be sufficient for your purpose.  The two 

files to check are HtmlDocument and FileDocument.  These files read  
the HTML and text files that the demo indexes.

 Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Andrew Boyd
Thanks for bearing with me Max.  

I do understand that the hits come back sorted by decending score after their 
Similarity has been computed relative to the query vector.  What I was hoping 
to do was use the built in fuctionality of lucene to calculate some term 
weights specifically wi = ti * IDFi.

Assuming I had Hits I was hoping to do something like this:

for(int idx = 0; idx < hits.lingth(); idx++){
   int id = hits.id(idx);

   TermFreqVector[] termFreqVec = indexReader.getTermFreqVectors(id);

   // Using the termFreqVec calculate the wi for each term in that document.
   for(termFreqVec){
   TermWeight wi = Similarity.wi(termFreqVec[],  termFreqVec.length); 

   ...
   }

}

Andrew


-Original Message-
From: Max Pfingsthorn <[EMAIL PROTECTED]>
Sent: Jun 3, 2005 4:13 AM
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.

Hi,

when IndexSearcher.search gives you a Hits object back, all results are already 
sorted by their score, which is computed internally using the Similarity. You 
can access it via Hits.score(n) (see 
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Hits.html). 
This is also shown in the demo in org.apache.lucene.demo.SearchFiles from SVN. 
(see 
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/demo/org/apache/lucene/demo/SearchFiles.java?rev=150739&view=markup).

Hope that helps.
max


-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 21:22
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.


Ok.  So if I get 10 Documents back from a search and I want to get the top 5 
weighted terms for each of the 10 documents what API call should I use?  I'm 
unable to find the connection between Similarity and a Document.

I know I'm missing the elephant that must be in the middle of the room.  Or 
maybe it's not there.
Is what I'm trying to do do-able?

Thanks,

Andrew

-Original Message-
From: Max Pfingsthorn <[EMAIL PROTECTED]>
Sent: Jun 2, 2005 5:33 AM
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.

Hi,

DefaultSimilarity uses exactly this weighting scheme. Makes sense since it's a 
pretty standard relevance measure...

Bye!
max

-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 11:39
To: java-user@lucene.apache.org
Subject: calculate wi = tfi * IDFi for each document.


If I have search results how can I calculate, using lucene's API,  wi = tfi * 
IDFi for each document.

wi= term weight
tfi= term frequency in a document
IDFi = inverse document frequency = log(D/dfi)
dfi   = document frequency or number of documents containing term i
D= number of documents in my search result

Thanks,

Andrew

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Andrew Boyd
Software Architect
Sun Certified J2EE Architect
B&B Technical Services Inc.
205.422.2557

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Andrew Boyd
Thanks for the reply.  It looks like I can use parts of Similarity.

I'll post back once I get it working or at least closer ;-)

Andrew

-Original Message-
From: Grant Ingersoll <[EMAIL PROTECTED]>
Sent: Jun 3, 2005 6:51 AM
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.

I think the TermFreqVector (reader.getTermVector) has the info you want
per document.  You will need to sort it by frequency to get the top
terms in each document.  It doesn't give you the wi, just tfi, but the
whole score is implied by the fact that you have the top 10 documents, I
think.

-Grant

>>> [EMAIL PROTECTED] 6/2/2005 3:21:35 PM >>>
Ok.  So if I get 10 Documents back from a search and I want to get the
top 5 weighted terms for each of the 10 documents what API call should I
use?  I'm unable to find the connection between Similarity and a
Document.

I know I'm missing the elephant that must be in the middle of the room.
 Or maybe it's not there.
Is what I'm trying to do do-able?

Thanks,

Andrew

-Original Message-
From: Max Pfingsthorn <[EMAIL PROTECTED]>
Sent: Jun 2, 2005 5:33 AM
To: java-user@lucene.apache.org 
Subject: RE: calculate wi = tfi * IDFi for each document.

Hi,

DefaultSimilarity uses exactly this weighting scheme. Makes sense since
it's a pretty standard relevance measure...

Bye!
max

-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 02, 2005 11:39
To: java-user@lucene.apache.org 
Subject: calculate wi = tfi * IDFi for each document.


If I have search results how can I calculate, using lucene's API,  wi =
tfi * IDFi for each document.

wi= term weight
tfi= term frequency in a document
IDFi = inverse document frequency = log(D/dfi)
dfi   = document frequency or number of documents containing term i
D= number of documents in my search result

Thanks,

Andrew

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Andrew Boyd
Software Architect
Sun Certified J2EE Architect
B&B Technical Services Inc.
205.422.2557

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Corey Keith
With this approach all work is done at the word level.  When we have a phrase 
query the results will contain pages with the entire phrase but when we go to 
highlight the document _all_ words in the phrase regardless of being in the 
phrase will be highlighted.  Is that correct?  It would also be difficult to 
get the best fragment in a similar way to the current highlighter?  

>>> [EMAIL PROTECTED] 06/02/05 9:02 PM >>>

This is a pretty interesting problem.  I envy you.

I would avoid the existing highlighter for your purposes -- highlighting
in token space is a very differnet problem from "highlihgting" in 2D
space.

based on the XML sample you provided, it looks like your XML files
are allready a "tokenized" form of the orriginal OCR data -- by which i
mean the page has allready been tokenized into words who position is
recorded.

I would parse these XML docs to generate two things:
1) a stream of words for analysis/filtering (ie: stop words, stemming,
   synonyms)
2) a datastructure mapping words to lists of positions (ie: if the
   same word apears in multiple places, list the word once, followed
   by each set of coordinates)

use #1 in the usual way, and add a serialized form of #2 to your index as
a Stored Keyword -- at query time, the words from your initial query can
be looked up in that data strucutre to find the regions to "highlight"



: I am involved in a project which is trying to provide searching and hit 
highlighting on the scanned image of historical newspapers.  We have an XML 
based OCR format.  A sample is below.  We need to index the CONTENT attribute 
of the String element which is the easy part.  We would like to be able find 
the "hits" within this XML document in order to use the positioning information 
to draw the highlight boxes on the image.  It doesn't make a lot of sense to 
just extract the CONTENT and index that because we loose the positioning 
information.  My second thought was to make a custom analyzer which dropped 
everything except for the content element and then used the highlighting class 
in the sandbox to reanalyze the XML document and mark the hits.  With the 
marked hits in the XML we could find the position information and draw on the 
image.  Has anyone else worked with OCR information and lucene.  What was your 
approach?  Does this approach seem sound?  Any recommendations?
:
: Thanks, Corey
:
:  
:   
:   
:   
:   
:   
:   
:   
:   
:   
:   
:   
:   
:  
:
:
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Richard Krenek
Corey,
  I have one off the wall approach that may or may not work for you.
If you convert your scanned images to PDF then use something like
Acrobat to convert those PDFs into PDFs with hidden text (The OCR
data). You can then tell Acrobat Reader via XML what to highlight when
your user opens the PDF.
  Not sure if that helps you but may give you some alternate ideas.

Richard


On 6/2/05, Corey Keith <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I am involved in a project which is trying to provide searching and hit 
> highlighting on the scanned image of historical newspapers.  We have an XML 
> based OCR format.  A sample is below.  We need to index the CONTENT attribute 
> of the String element which is the easy part.  We would like to be able find 
> the "hits" within this XML document in order to use the positioning 
> information to draw the highlight boxes on the image.  It doesn't make a lot 
> of sense to just extract the CONTENT and index that because we loose the 
> positioning information.  My second thought was to make a custom analyzer 
> which dropped everything except for the content element and then used the 
> highlighting class in the sandbox to reanalyze the XML document and mark the 
> hits.  With the marked hits in the XML we could find the position information 
> and draw on the image.  Has anyone else worked with OCR information and 
> lucene.  What was your approach?  Does this approach seem sound?  Any 
> recommendations?
> 
> Thanks, Corey
> 
>  
>VPOS="123644.0" CONTENT="The" WC="1.0"/>
>   
>VPOS="123711.0" CONTENT="female" WC="1.0"/>
>   
>VPOS="123711.0" CONTENT="lays" WC="1.0"/>
>   
>VPOS="123711.0" CONTENT="about" WC="1.0"/>
>   
>VPOS="123770.0" CONTENT="140" WC="1.0"/>
>   
>VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
>   
>  
> 
> 
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-06-03 Thread Paul Libbrecht

Robert,

Le 2 juin 05, à 21:42, Tansley, Robert a écrit :

It seems that there are even more options --
4/ One index, with a separate Lucene document for each (item,language) 
combination, with one field that specifies the language
5/ One index, one Lucene document per item, with field names that 
include the language (e.g. title_en, title_cn)
I quite like 4, because you can search with no language constraint, or 
with one as Paul suggests below.


You can in both cases. In the second, you need to expand the query (ie 
searching for carrot would search text_en:carrot or text_cn:carrot", 
which, I think is fair as long as you don't a two kilometer's list of 
languages.


However, some "non language-specific" data might need to be repeated 
(e.g. dates), unless we had an extra Lucene document for all that.  I 
wonder what the various pros and cons in terms of index size and 
performance would be in each case?  I really don't have enough 
knowledge of Lucene to have any idea...


If you separate the indices you won't, as far as I know, be able to 
query simultaneously (e.g. some text which, as well, is new 
enough).


paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Max Pfingsthorn
Aha :)

So you want to do blind relevance feedback?
I guess the term vectors will be the way to go then. Otherwise, I don't know 
how to access the terms of a document. And: Are you sure you need the TF.IDF 
weights for each term ]? Maybe it would be enough to just use TF for sorting, 
as that is already present in the term vector. In any case, Similarity knows 
how to compute IDF for a term.

Bye!
max

-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED]
Sent: Friday, June 03, 2005 14:00
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.


Thanks for bearing with me Max.  

I do understand that the hits come back sorted by decending score after their 
Similarity has been computed relative to the query vector.  What I was hoping 
to do was use the built in fuctionality of lucene to calculate some term 
weights specifically wi = ti * IDFi.

Assuming I had Hits I was hoping to do something like this:

for(int idx = 0; idx < hits.lingth(); idx++){
   int id = hits.id(idx);

   TermFreqVector[] termFreqVec = indexReader.getTermFreqVectors(id);

   // Using the termFreqVec calculate the wi for each term in that document.
   for(termFreqVec){
   TermWeight wi = Similarity.wi(termFreqVec[],  termFreqVec.length); 

   ...
   }

}

Andrew


-Original Message-
From: Max Pfingsthorn <[EMAIL PROTECTED]>
Sent: Jun 3, 2005 4:13 AM
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.

Hi,

when IndexSearcher.search gives you a Hits object back, all results are already 
sorted by their score, which is computed internally using the Similarity. You 
can access it via Hits.score(n) (see 
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Hits.html). 
This is also shown in the demo in org.apache.lucene.demo.SearchFiles from SVN. 
(see 
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/demo/org/apache/lucene/demo/SearchFiles.java?rev=150739&view=markup).

Hope that helps.
max


-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 21:22
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.


Ok.  So if I get 10 Documents back from a search and I want to get the top 5 
weighted terms for each of the 10 documents what API call should I use?  I'm 
unable to find the connection between Similarity and a Document.

I know I'm missing the elephant that must be in the middle of the room.  Or 
maybe it's not there.
Is what I'm trying to do do-able?

Thanks,

Andrew

-Original Message-
From: Max Pfingsthorn <[EMAIL PROTECTED]>
Sent: Jun 2, 2005 5:33 AM
To: java-user@lucene.apache.org
Subject: RE: calculate wi = tfi * IDFi for each document.

Hi,

DefaultSimilarity uses exactly this weighting scheme. Makes sense since it's a 
pretty standard relevance measure...

Bye!
max

-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 11:39
To: java-user@lucene.apache.org
Subject: calculate wi = tfi * IDFi for each document.


If I have search results how can I calculate, using lucene's API,  wi = tfi * 
IDFi for each document.

wi= term weight
tfi= term frequency in a document
IDFi = inverse document frequency = log(D/dfi)
dfi   = document frequency or number of documents containing term i
D= number of documents in my search result

Thanks,

Andrew

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Andrew Boyd
Software Architect
Sun Certified J2EE Architect
B&B Technical Services Inc.
205.422.2557

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: calculate wi = tfi * IDFi for each document.

2005-06-03 Thread Grant Ingersoll
If you can, I think there has been enough interest in the past on this,
a patch that exposes the wi information would probably be useful to
others (not that I am saying it would be committed, as I can't speak for
the committers on the project)

>>> [EMAIL PROTECTED] 6/3/2005 8:19:16 AM >>>
Thanks for the reply.  It looks like I can use parts of Similarity.

I'll post back once I get it working or at least closer ;-)

Andrew

-Original Message-
From: Grant Ingersoll <[EMAIL PROTECTED]>
Sent: Jun 3, 2005 6:51 AM
To: java-user@lucene.apache.org 
Subject: RE: calculate wi = tfi * IDFi for each document.

I think the TermFreqVector (reader.getTermVector) has the info you
want
per document.  You will need to sort it by frequency to get the top
terms in each document.  It doesn't give you the wi, just tfi, but the
whole score is implied by the fact that you have the top 10 documents,
I
think.

-Grant

>>> [EMAIL PROTECTED] 6/2/2005 3:21:35 PM >>>
Ok.  So if I get 10 Documents back from a search and I want to get the
top 5 weighted terms for each of the 10 documents what API call should
I
use?  I'm unable to find the connection between Similarity and a
Document.

I know I'm missing the elephant that must be in the middle of the
room.
 Or maybe it's not there.
Is what I'm trying to do do-able?

Thanks,

Andrew

-Original Message-
From: Max Pfingsthorn <[EMAIL PROTECTED]>
Sent: Jun 2, 2005 5:33 AM
To: java-user@lucene.apache.org 
Subject: RE: calculate wi = tfi * IDFi for each document.

Hi,

DefaultSimilarity uses exactly this weighting scheme. Makes sense
since
it's a pretty standard relevance measure...

Bye!
max

-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 02, 2005 11:39
To: java-user@lucene.apache.org 
Subject: calculate wi = tfi * IDFi for each document.


If I have search results how can I calculate, using lucene's API,  wi
=
tfi * IDFi for each document.

wi= term weight
tfi= term frequency in a document
IDFi = inverse document frequency = log(D/dfi)
dfi   = document frequency or number of documents containing term i
D= number of documents in my search result

Thanks,

Andrew

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



Andrew Boyd
Software Architect
Sun Certified J2EE Architect
B&B Technical Services Inc.
205.422.2557

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing multiple languages

2005-06-03 Thread Max Pfingsthorn
Hi

You could use the ParalellReader for this if you have all documents in all 
languages. Then, the metadata fields can be stored in one of the field data 
files, while each languages gets its own field data file...

max

-Original Message-
From: Paul Libbrecht [mailto:[EMAIL PROTECTED]
Sent: Friday, June 03, 2005 14:23
To: java-user@lucene.apache.org
Subject: Re: Indexing multiple languages


Robert,

Le 2 juin 05, à 21:42, Tansley, Robert a écrit :
> It seems that there are even more options --
> 4/ One index, with a separate Lucene document for each (item,language) 
> combination, with one field that specifies the language
> 5/ One index, one Lucene document per item, with field names that 
> include the language (e.g. title_en, title_cn)
> I quite like 4, because you can search with no language constraint, or 
> with one as Paul suggests below.

You can in both cases. In the second, you need to expand the query (ie 
searching for carrot would search text_en:carrot or text_cn:carrot", 
which, I think is fair as long as you don't a two kilometer's list of 
languages.

> However, some "non language-specific" data might need to be repeated 
> (e.g. dates), unless we had an extra Lucene document for all that.  I 
> wonder what the various pros and cons in terms of index size and 
> performance would be in each case?  I really don't have enough 
> knowledge of Lucene to have any idea...

If you separate the indices you won't, as far as I know, be able to 
query simultaneously (e.g. some text which, as well, is new 
enough).

paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: managing docids for ParallelReader

2005-06-03 Thread Sebastian Marius Kirsch
Hi Doug,

I took up your suggestion to use a ParallelReader for adding more
fields to existing documents. I now have two indexes with the same
number of documents, but different fields. One field is duplicated
(the id field.)

I wrote a small class to merge those two indexes into one index; it is
attached to this message. However, when I run this class in order to
merge the two indexes, I get a NullPointerException:

Exception in thread "main" java.lang.NullPointerException
at 
org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318)
at 
org.apache.lucene.index.ParallelReader$ParallelTermDocs.seek(ParallelReader.java:294)
at 
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325)
at 
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296)
at 
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:596)
at 
org.sebastiankirsch.thesis.util.ParallelIndexMergeTool.main(ParallelIndexMergeTool.java:27)

I'm afraid that this is my first journey into the bowels of Lucene,
and I don't know where to look for sources of the problem. I tried
removing the duplicate field, but the symptoms stay the same. Does
this mean that I cannot merge two indexes from a ParallelReader into
one normal? Or is it a problem with my index? Or a problem somewhere
else?

I am using revision 179785 from the svn repo.

Thanks very much for your time, Sebastian


public static void main(String[] args) throws IOException {
IndexWriter writer = new IndexWriter(args[0], new 
StandardAnalyzer(), true);
ParallelReader reader = new ParallelReader();

for (int i = 1; i < args.length; i++) {
reader.add(IndexReader.open(args[i]));
}

writer.addIndexes(new IndexReader[] { reader });
writer.optimize();
writer.close();
}

-- 
Sebastian Kirsch <[EMAIL PROTECTED]> [http://www.sebastian-kirsch.org/]

NOTE: New email address! Please update your address book.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Erik Hatcher


On Jun 3, 2005, at 8:50 AM, Corey Keith wrote:

With this approach all work is done at the word level.  When we  
have a phrase query the results will contain pages with the entire  
phrase but when we go to highlight the document _all_ words in the  
phrase regardless of being in the phrase will be highlighted.  Is  
that correct?  It would also be difficult to get the best fragment  
in a similar way to the current highlighter?


The current Highlighter also does it by Term even if the query is a  
PhraseQuery - so you're not losing capability by not using  
Highlighter in this case.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-06-03 Thread Doug Cutting

Tansley, Robert wrote:

What if we're trying to index multiple languages in the same site?  Is
it best to have:

1/ one index for all languages
2/ one index for all languages, with an extra language field so searches
can be constrained to a particular language
3/ separate indices for each language?


I'd use 2/.  In particular, use the same field for the content, title, 
etc., even if when produced by different analyzers.  Have a "lang" field 
that names the language of the document.


At query time, use an analyzer selected by the user's environment (e.g., 
HTTP lang header).  If folks are getting false positives, where a term 
in another language that means something different is matching their 
query, they can use a "lang" pulldown to remove documents from other 
languages, implemented as a Lucene Filter.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: managing docids for ParallelReader

2005-06-03 Thread Doug Cutting

Sebastian Marius Kirsch wrote:

I took up your suggestion to use a ParallelReader for adding more
fields to existing documents. I now have two indexes with the same
number of documents, but different fields.


Does search work using the ParalleReader?


One field is duplicated
(the id field.)


Why is this duplicated?  Just curious.  That shouldn't cause a problem.


I wrote a small class to merge those two indexes into one index; it is
attached to this message. However, when I run this class in order to
merge the two indexes, I get a NullPointerException:


Why are you merging?  Why not just search using the ParallelReader? 
Again, just curious.  This should work.



Exception in thread "main" java.lang.NullPointerException
at 
org.apache.lucene.index.ParallelReader$ParallelTermPositions.seek(ParallelReader.java:318)
at 
org.apache.lucene.index.ParallelReader$ParallelTermDocs.seek(ParallelReader.java:294)
at 
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325)
at 
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296)
at 
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:596)
at 
org.sebastiankirsch.thesis.util.ParallelIndexMergeTool.main(ParallelIndexMergeTool.java:27)


This could be a bug.  I have not tested merging with a ParallelReader. 
Can you please try to adding a test case to TestParallelReader that 
demonstrates this?


Thanks,

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



deleting on a keyword field

2005-06-03 Thread Max Pfingsthorn
Hi!

I'm trying to delete a document from the index. Somehow it doesn't work. I made 
a Field.Keyword out of my document's URI and would now like to delete a 
document with a certain uri like so:

reader.delete(new Term(URI_FIELD, uri));

This does not remove anything. Do I have to make the uri a normal field?

Thanks for your help in advance!

Best regards,

Max Pfingsthorn

Hippo  

Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-
[EMAIL PROTECTED] / www.hippo.nl
--

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Preserving original HTML file offsets for highlighting, need HTMLTokenizer?

2005-06-03 Thread Doug Cutting

Fred Toth wrote:

I'm thinking we need something like "HTMLTokenizer" which bridges the
gap between StandardAnalyzer and an external HTML parser. Since so
many of us are dealing with HTML, I would think this would be generally
useful for many problems. It could work this way:

Given this input:

Howdy thereHello 
world


An HTMLTokenizer would deliver something like this sort of token stream
(the numbers represent the start/end offsets for the token):

TAG, , 0, 6
TAG, , 6, 12
TAG, , 12, 18
WORD, Howdy, 18, 22
WORD, there, 23, 28
TAG, , 28, 36
etc.

Given the above, a filter could then strip out the HTML, but pass the 
WORDs on
to Lucene, preserving the offsets in the source file. These would be 
used later
during highlighting. Clever filters could be selective about what gets 
stripped and

what gets passed on.


For what it's worth, I think that's a good design and would love to see 
this as a contribution.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: deleting on a keyword field

2005-06-03 Thread Erik Hatcher


On Jun 3, 2005, at 12:50 PM, Max Pfingsthorn wrote:


Hi!

I'm trying to delete a document from the index. Somehow it doesn't  
work. I made a Field.Keyword out of my document's URI and would now  
like to delete a document with a certain uri like so:


reader.delete(new Term(URI_FIELD, uri));

This does not remove anything. Do I have to make the uri a normal  
field?


Try a search using a TermQuery for that Term - are the document(s)  
you expect to be deleted found?  If not, then you'll have to research  
how you indexed to ensure the keyword


Erik



Re: deleting on a keyword field

2005-06-03 Thread Daniel Naber
On Friday 03 June 2005 18:50, Max Pfingsthorn wrote:

> reader.delete(new Term(URI_FIELD, uri));
>
> This does not remove anything. Do I have to make the uri a normal field?

How do you know nothing was deleted? Are you aware that you need to re-open 
your IndexSearcher/Reader in order to see the changes made to the index?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: deleting on a keyword field

2005-06-03 Thread Ernesto De Santis

from javadoc:

public final int *delete*(Term  
term)
throws IOException 


Deletes all documents containing |term|. This is useful if one uses a 
document field to hold a unique ID string for the document. Then to 
delete such a document, one merely constructs a term with the 
appropriate field and the unique ID string as its text and passes it to 
this method.* Returns the number of documents deleted.*


bye
Ernesto.


Daniel Naber escribió:


On Friday 03 June 2005 18:50, Max Pfingsthorn wrote:

 


reader.delete(new Term(URI_FIELD, uri));

This does not remove anything. Do I have to make the uri a normal field?
   



How do you know nothing was deleted? Are you aware that you need to re-open 
your IndexSearcher/Reader in order to see the changes made to the index?


Regards
Daniel

 



--
Ernesto De Santis - Colaborativa.net
La Plata, Argentina.
http://www.colaborativa.net/



___ 
A tu celular ¿no le falta algo? 
Usá Yahoo! Messenger y Correo Yahoo! en tu teléfono celular. 
Más información en http://movil.yahoo.com.ar



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing multiple languages

2005-06-03 Thread Bruce Ritchie
> Tansley, Robert wrote:
> > What if we're trying to index multiple languages in the 
> same site?  Is 
> > it best to have:
> > 
> > 1/ one index for all languages
> > 2/ one index for all languages, with an extra language field so 
> > searches can be constrained to a particular language 3/ separate 
> > indices for each language?
> 
> I'd use 2/.  In particular, use the same field for the 
> content, title, etc., even if when produced by different 
> analyzers.  Have a "lang" field that names the language of 
> the document.

We use 2/ and use filters when we want to search only within a particular 
language. Just be sure touse the same analyzer when indexing and 
searching within a particular language.


Regards,

Bruce Ritchie

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]