Re: Using RangeFilter

2008-01-24 Thread vivek sar
I've a field as NO_NORM, does it has to be untokenized to be able to
sort on it?


On Jan 21, 2008 12:47 PM, Antony Bowesman [EMAIL PROTECTED] wrote:
 vivek sar wrote:
  I need to be able to sort on optime as well, thus need to store it .

 Lucene's default sorting does not need the field to be stored, only indexed as
 untokenized.
 Antony





 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is Fair Similarity working with lucene 2.2 ?

2008-01-24 Thread Fabrice Robini

Is there anything I can do to pass my Unit-Test ? 
Or it is impossible ?

Thanks a lot,

Fabrice



Fabrice Robini wrote:
 
 Hi Srikant,
 
 I really thank you for your reply, it's very interesting.
 I have to say I am confused with that now... 
 I do not know what I can to for passing this Unit test...
 
 I agree with you, it may be an issue of computing relevance.
 
 Fabrice
 
 
 Srikant Jakilinki-3 wrote:
 
 OK, got it to work. Thanks.
 
 By a quick scoring comparision, I got the same scores for both hits. 
 Maybe there is a loss of precision somewhere. Or when scores are equal, 
 Lucene is doing something unintended/overlooked and thus putting shorter 
 documents higher as the experiment is a special case where the TF of a 
 queried term (for both suites, the TF of x = 10%) is equal which is very 
 rarely. Or maybe the IDF factor is kicking in in some strange way 
 although it shouldnt. There are a number of varied reasons, but for the 
 naked eye, there isnt much.
 
 However, that said, length normalization is not a science but an art and 
 the simple scheme we have here in the FairSimilarity will not perform 
 always as expected in real world scenarios. Maybe I am missing something 
 or have forgot my basics but that is not to say your observation is
 trivial.
 
 Rather, the contrary. Hope there will be more activity on this topic 
 because it is an issue of computing relevance which is the core of
 search.
 
 Cheers,
 Srikant
 
 Fabrice Robini wrote:
 Oooops sorry, bad cut/paste...

 Here is the right one :-)

 public void testFairSimilarity() throws CorruptIndexException,
 IOException, ParseException
 {
 Directory theDirectory = new RAMDirectory();
 Analyzer theAnalyzer = new StandardAnalyzer();
 
 IndexWriter theIndexWriter = new IndexWriter(theDirectory,
 theAnalyzer);
 theIndexWriter.setSimilarity(new FairSimilarity());
 
 Document doc1 = new Document();
 Field name1 = new Field(NAME, SHORT_SUITE, Field.Store.YES,
 Field.Index.UN_TOKENIZED);
 Field content1 = new Field(CONTENT, x 2 3 4 5 6 7 8 9 10,
 Field.Store.NO, Field.Index.TOKENIZED);
 doc1.add(name1);
 doc1.add(content1);
 theIndexWriter.addDocument(doc1);
 
 Document doc2 = new Document();
 Field name2 = new Field(NAME, BIG_SUITE, Field.Store.YES,
 Field.Index.UN_TOKENIZED);
 Field content2 = new Field(CONTENT, x x 3 4 5 6 7 8 9 10 11
 12 13
 14 15 16 17 18 19 20, Field.Store.NO, Field.Index.TOKENIZED);
 doc2.add(name2);
 doc2.add(content2);
 theIndexWriter.addDocument(doc2);
 
 theIndexWriter.close();
 
 Searcher searcher = new IndexSearcher(theDirectory);
 searcher.setSimilarity(new FairSimilarity());

 QueryParser queryParser = new QueryParser(CONTENT,
 theAnalyzer);

 Hits hits = searcher.search(queryParser.parse(x));

 assertEquals(2, hits.length());
 assertEquals(BIG_SUITE, hits.doc(0).get(NAME));
 assertEquals(SHORT_SUITE, hits.doc(1).get(NAME));
 }
 



 Srikant Jakilinki-3 wrote:
   
 Well, I cant seem to even get past the assertions of this code.

 The first assertion is failing in that I get 0 hits. I am using 
 SimpleAnalyzer since I do not have a FrenchAnalyzer.

 Any thoughts?
 Srikant

 
 --
 Free pop3 email with a spam filter.
 http://www.bluebottle.com/tag/5
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Is-Fair-Similarity-working-with-lucene-2.2---tp15001250p15060757.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple searchers (Was: CachingWrapperFilter: why cache per IndexReader?)

2008-01-24 Thread Toke Eskildsen
On Thu, 2008-01-24 at 08:18 +1100, Antony Bowesman wrote:
 These are odd.  The last case in both of the above shows a slowdown compared 
 to 
 2.1 index and version and in the first 50K queries, the 2.3 index and version 
 is 
 even slower than 2.3 with 2.1 index.  It catches up in the longer result set.
 
 Any ideas why that might be.

Looking at the graphs I can see that the 2 threads / shared searcher is
suspiciously fast at getting up to full speed. It could be because the
disk-read-cache wasn't properly flushed. I'll rerun the test.

I've performed an inspection of graphs for my other published
measurements and they looked as expected. I'll spend some more time on
it tomorrow.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene search strings two

2008-01-24 Thread Prathiba Paka
Hi all.
I need to check two conditions in search
first i need to find out bank name next
in those i need to find documents consisting particular city
finally i need the documents which satisfy both conditions
i.e., documents with bank+city
please can anyone help me

Thanks,
prathiba.P


Re: Using RangeFilter

2008-01-24 Thread Antony Bowesman

vivek sar wrote:

I've a field as NO_NORM, does it has to be untokenized to be able to
sort on it?


NO_NORMS is the same as UNTOKENIZED + omitNorms, so you can sort on that.
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Full Text Searching a Relational Model

2008-01-24 Thread yarong
Hi,

(Warning, not for the weak-hearted)

I'm currently working on a project where we have a large and complex data
model, related to Genomics. We are trying to build a search engine that
provides full text and field-based text searches for our customer base
(mostly academic research), and are evaluating different tools for this
purpose.

As a starting point, we have, as an example, a set of objects (stored in
tables as a relational model):
Gene [ID, Symbol, Description]
Article - M:M with Gene [ID, Title]
Disease - M:M with Gene [ID, Name]
Author - M:M with Article [ID, Name]
(Note: M:M tables exist, just link IDs)

An example model would be (hierarchical, relations dealt with as
duplications)

  Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor]
Article [ID=1, Title=EGFR mutations in lung cancer: correlation with
clinical response to gefitinib therapy]
  Author [ID=1, Name=H. Michaelson]
  Author [ID=2, Name=J. Watson]
Article [ID=2, Title=Proteomics analysis of epidermal protein kinases
by target class-selective prefractionation and tandem mass
spectrometry]
  Author [ID=1, Name=H. Michaelson]
  Author [ID=3, Name=M. Roberts]
Disease [ID=1, Name=Epidermal sluffing]

  Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase]
Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine
hydrolase: implications for the three-dimensional structure]
  Author [ID=4, Name=B. Cohen]
  Author [ID=5, Name=L. Alexander]
Article [ID=2, Title=Proteomics analysis of epidermal protein kinases
by target class-selective prefractionation and tandem mass
spectrometry]
  Author [ID=1, Name=H. Michaelson]
  Author [ID=3, Name=M. Roberts]

Note IDs in the objects above, as they relay the relations in the
hierarchical model.

In our Full-Text search, we would like to allow users to search ANY
textual field for any string. For instance, the term epidermal, and
display the list of genes which have any data associated with them with
that term (ranked, of course).
Our list of results would be something like:

EGFR
  Found in Description (epidermal growth factor receptor)
  Found in Article ID#2, in Title (proteomics analysis of epidermal
protein kinases by target class-selective prefractionation and tandem
mass spectrometry)
  Found in Disease ID#1, in Name (Epidermal sluffing)

AHCY
  Found in Article ID#2, in Title (proteomics analysis of epidermal
protein kinases by target class-selective prefractionation and tandem
mass spectrometry)

Note that the results retain a hierarchial view of our Genes (us being
Gene-Centric, we're pretty much framing the question find this term
related in information related to those genes). Also note that Article ID
#2 has an M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to
that fact, AHCY is considered a gene that has epidermal in its
annotations.

Obviously, we'd like to rank fields by location in hierarchy (A term in a
gene name is scored higher than the name of the author of an article
related to a gene) and by number of hits (number of times a term is found
related to that gene, 3 in the case of EGFR above).

Ideas for how to take on this challenge? Implementation? Tools?

Thanks!
Yaron Golan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



LogMergePolicy

2008-01-24 Thread Koji Sekiguchi
Hello,

I'm curious, why is LogMergePolicy named *Log*MergePolicy?
(Why not ExpMergePolicy? :-)

Thank you,

Koji


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: LogMergePolicy

2008-01-24 Thread Steven Parkes
I'm curious, why is LogMergePolicy named *Log*MergePolicy?
(Why not ExpMergePolicy? :-)

Well, I guess it's a matter of perspective. When you look at the way the
algorithm works, the merge decisions are based on a concept of level and
levels are assigned based on the log of the number of documents in a
segment (going back to Ning's equation). When one is in the code, it's
very natural to think/talk about log-base-merge-factor.

This does result in the number of documents in segments being
order-of-magnitude/exponentially related so that might have made more
sense to users, so perhaps it wasn't the best decision ...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LogMergePolicy

2008-01-24 Thread Yonik Seeley
On Jan 24, 2008 8:40 AM, Steven Parkes [EMAIL PROTECTED] wrote:
 I'm curious, why is LogMergePolicy named *Log*MergePolicy?
 (Why not ExpMergePolicy? :-)

 Well, I guess it's a matter of perspective. When you look at the way the
 algorithm works, the merge decisions are based on a concept of level and
 levels are assigned based on the log of the number of documents in a
 segment (going back to Ning's equation). When one is in the code, it's
 very natural to think/talk about log-base-merge-factor.

 This does result in the number of documents in segments being
 order-of-magnitude/exponentially related so that might have made more
 sense to users, so perhaps it wasn't the best decision ...

I could be accurately described either way, but there is precedent for
log too... log-normal, for example is normal after one takes the log
(it could have been called exponential-normal).  I also tend to think
of our number system as logarithmic in nature rather than exponential.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LogMergePolicy

2008-01-24 Thread Koji Sekiguchi

Thank you Steven and Yonik,

I think I got it. And I can find LogMergePolicy uses
Math.log() to find merges. :-)

Thank you again,

Koji


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Full Text Searching a Relational Model

2008-01-24 Thread Chris Lu
In general, you just need to denorm the data and create a list of
Genes, and add each Genes' related information by SQLs. Ranking can be
easily adjusted via each field's weight, not a big deal.

Seems an ideal case for using DBSight. It can also do incremental
indexing, which you may also need.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request)
got 2.6 Million Euro funding!


On Jan 24, 2008 5:42 AM,  [EMAIL PROTECTED] wrote:
 Hi,

 (Warning, not for the weak-hearted)

 I'm currently working on a project where we have a large and complex data
 model, related to Genomics. We are trying to build a search engine that
 provides full text and field-based text searches for our customer base
 (mostly academic research), and are evaluating different tools for this
 purpose.

 As a starting point, we have, as an example, a set of objects (stored in
 tables as a relational model):
 Gene [ID, Symbol, Description]
 Article - M:M with Gene [ID, Title]
 Disease - M:M with Gene [ID, Name]
 Author - M:M with Article [ID, Name]
 (Note: M:M tables exist, just link IDs)

 An example model would be (hierarchical, relations dealt with as
 duplications)

   Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor]
 Article [ID=1, Title=EGFR mutations in lung cancer: correlation with
 clinical response to gefitinib therapy]
   Author [ID=1, Name=H. Michaelson]
   Author [ID=2, Name=J. Watson]
 Article [ID=2, Title=Proteomics analysis of epidermal protein kinases
 by target class-selective prefractionation and tandem mass
 spectrometry]
   Author [ID=1, Name=H. Michaelson]
   Author [ID=3, Name=M. Roberts]
 Disease [ID=1, Name=Epidermal sluffing]

   Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase]
 Article [ID=3, Title=Limited proteolysis of S-adenosylhomocysteine
 hydrolase: implications for the three-dimensional structure]
   Author [ID=4, Name=B. Cohen]
   Author [ID=5, Name=L. Alexander]
 Article [ID=2, Title=Proteomics analysis of epidermal protein kinases
 by target class-selective prefractionation and tandem mass
 spectrometry]
   Author [ID=1, Name=H. Michaelson]
   Author [ID=3, Name=M. Roberts]

 Note IDs in the objects above, as they relay the relations in the
 hierarchical model.

 In our Full-Text search, we would like to allow users to search ANY
 textual field for any string. For instance, the term epidermal, and
 display the list of genes which have any data associated with them with
 that term (ranked, of course).
 Our list of results would be something like:

 EGFR
   Found in Description (epidermal growth factor receptor)
   Found in Article ID#2, in Title (proteomics analysis of epidermal
 protein kinases by target class-selective prefractionation and tandem
 mass spectrometry)
   Found in Disease ID#1, in Name (Epidermal sluffing)

 AHCY
   Found in Article ID#2, in Title (proteomics analysis of epidermal
 protein kinases by target class-selective prefractionation and tandem
 mass spectrometry)

 Note that the results retain a hierarchial view of our Genes (us being
 Gene-Centric, we're pretty much framing the question find this term
 related in information related to those genes). Also note that Article ID
 #2 has an M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to
 that fact, AHCY is considered a gene that has epidermal in its
 annotations.

 Obviously, we'd like to rank fields by location in hierarchy (A term in a
 gene name is scored higher than the name of the author of an article
 related to a gene) and by number of hits (number of times a term is found
 related to that gene, 3 in the case of EGFR above).

 Ideas for how to take on this challenge? Implementation? Tools?

 Thanks!
 Yaron Golan


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Creating search query

2008-01-24 Thread spring
Hi,

I have an index with some fields which are indexed and un_tokenized
(keywords) and one field which is indexed and tokenized (content).

Now I want to create a Query-Object:

TermQuery k1 = new TermQuery(new Term(foo, some foo));
TermQuery k2 = new TermQuery(new Term(bar, some bar));
QueryParser p = new QueryParser(content, new
SomeAnalyzer());//same analyzer is used for indexing
Query c =p.parse(text we are looking for);

BooleanQuery q = new BooleanQuery();
q.add(k1, Occur.MUST);
q.add(k2, Occur.MUST);
q.add(c, Occur.MUST); 

Is this the best way?

Thank you


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Compass

2008-01-24 Thread spring
Thank you. 

 -Original Message-
 From: Lukas Vlcek [mailto:[EMAIL PROTECTED] 
 Sent: Mittwoch, 23. Januar 2008 08:23
 To: java-user@lucene.apache.org
 Subject: Re: Compass
 
 Hi,
 
 I am using Compass with Spring and JPA. It works pretty nice. 
 I don't store
 index into database, I use traditional file system based Lucene index.
 Updates work very well but you have to be careful about 
 proper mapping of
 your objects into search engine (specially parent-child mappings).
 
 Regards,
 Lukas
 
 On Jan 21, 2008 8:08 PM, [EMAIL PROTECTED] wrote:
 
  Hi,
 
  compass (http://www.opensymphony.com/compass/content/lucene.html)
  promisses
  many nice things in my opinion.
  Has anybody production experiences with it?
 
  Especially Jdbc Directory and Updates?
 
  Thank you.
 
 
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -- 
 http://blog.lukas-vlcek.com/
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Creating search query

2008-01-24 Thread Erick Erickson
That should work fine, assuming that foo and bar are the untokenized
fields and content is the tokenized content.

Erick

On Jan 24, 2008 1:18 PM, [EMAIL PROTECTED] wrote:

 Hi,

 I have an index with some fields which are indexed and un_tokenized
 (keywords) and one field which is indexed and tokenized (content).

 Now I want to create a Query-Object:

TermQuery k1 = new TermQuery(new Term(foo, some foo));
TermQuery k2 = new TermQuery(new Term(bar, some bar));
QueryParser p = new QueryParser(content, new
 SomeAnalyzer());//same analyzer is used for indexing
Query c =p.parse(text we are looking for);

BooleanQuery q = new BooleanQuery();
q.add(k1, Occur.MUST);
q.add(k2, Occur.MUST);
q.add(c, Occur.MUST);

 Is this the best way?

 Thank you


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




RE: Creating search query

2008-01-24 Thread spring
Yes, sorry, that's the case.

Thank you! 

 -Original Message-
 From: Erick Erickson [mailto:[EMAIL PROTECTED] 
 Sent: Donnerstag, 24. Januar 2008 19:49
 To: java-user@lucene.apache.org
 Subject: Re: Creating search query
 
 That should work fine, assuming that foo and bar are the untokenized
 fields and content is the tokenized content.
 
 Erick
 
 On Jan 24, 2008 1:18 PM, [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I have an index with some fields which are indexed and un_tokenized
  (keywords) and one field which is indexed and tokenized (content).
 
  Now I want to create a Query-Object:
 
 TermQuery k1 = new TermQuery(new Term(foo, some foo));
 TermQuery k2 = new TermQuery(new Term(bar, some bar));
 QueryParser p = new QueryParser(content, new
  SomeAnalyzer());//same analyzer is used for indexing
 Query c =p.parse(text we are looking for);
 
 BooleanQuery q = new BooleanQuery();
 q.add(k1, Occur.MUST);
 q.add(k2, Occur.MUST);
 q.add(c, Occur.MUST);
 
  Is this the best way?
 
  Thank you
 
 
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Design questions

2008-01-24 Thread spring
 -Original Message-
 From: Erick Erickson [mailto:[EMAIL PROTECTED] 
 Sent: Freitag, 11. Januar 2008 16:16
 To: java-user@lucene.apache.org
 Subject: Re: Design questions

 But you could also vary this scheme by simply storing in your document
 the offsets for the beginning of each page.

Well, this is the best for my app I think, but...

How do I find out these offsets?

I'm adding the content field with:

IndexWriter#add(new Field(content, myContentReader));

I have no clue how find out the offsets in this reader. Must be something
with an analyzer and a TokenStream?

Thank you


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Design questions

2008-01-24 Thread Erick Erickson
I think you'll have to implement your own Analyzer and count.
That is, every call to next() that returns a token will have to
also increment some counter by 1.

To use this, you must have some way of knowing when a page
ends, and at that point you call your instance of your custom
analyzer to see what the count is. Or your analyzer maintains
the list and you can call for it after you've added all the pages.

Analyzer.getPositionIncrementGap is called every time you
call document.add(field.

So, you have something like this
while (more pages for doc) {
   string pagedata = getPageText();
   doc.add(text, pagedata);
}

Under the covers, your custom analyzer adds the current offset
(which you've kept track of) to, say, an ArrayList. And after the
last page is added, you get this arraylist and add it to your
document.

Or, you could just do things twice. That is, send your text through
a TokenStream, then call next() and count. Then send it all
through doc.add().

There are probably cleverer ways, but that should do for a start.

Best
Erick

On Jan 24, 2008 2:33 PM, [EMAIL PROTECTED] wrote:

  -Original Message-
  From: Erick Erickson [mailto:[EMAIL PROTECTED]
  Sent: Freitag, 11. Januar 2008 16:16
  To: java-user@lucene.apache.org
  Subject: Re: Design questions

  But you could also vary this scheme by simply storing in your document
  the offsets for the beginning of each page.

 Well, this is the best for my app I think, but...

 How do I find out these offsets?

 I'm adding the content field with:

 IndexWriter#add(new Field(content, myContentReader));

 I have no clue how find out the offsets in this reader. Must be something
 with an analyzer and a TokenStream?

 Thank you


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




RE: Lucene, HTML and Hebrew

2008-01-24 Thread Itamar Syn-Hershko
Steve and all,

I didn't know whether to send a detailed description of my case to aid with
seeing the whole picture, or to send a list of short questions which will
require loads of follow-up. I guess I know what is better now, thanks

 Lucene does not store proximity relations between data in different
fields, only within individual fields

So are 2 calls for doc-add with the same field but different texts are
considered as 1 field (latter call being internally appended into the
former, merged into one field), or as two instances of the same field which
do not share proximity and frequency data?
As it seems from what you wrote later in your response, it seems the case is
the former. How can I inhibit this appending -- are there any more
approaches than appending an invalid string like $$$?

I've been thinking about this a bit, and I think I'd go with one big field
for all the content, and I'd want to incorporate the headers into it as
well. How would I boost those specific words - so the content field can
contain both all words and all headers in their original order (for
proximity and frequency data to be valid), yet keep the terms which were
originally in a header or a sub-header boosted? This can be a good practice
for boosting bolded or italic text in the normal paragrphs as well (only
with a lower boost).

 Generally, stemming increases recall (proportion of matching relevant
docs among relevant docs in the entire corpus), and decreases precision
(proportion of relevant docs among matching docs).

That’s a great definition, thanks.

I'm trying to think this through, since Hebrew is not a regular case. If you
will google for Hebrew and Stemming you will get pages which talk about how
complicated is Hebrew compared to English and other European languages.

[ Warning: technical data, questions follow after this paragraph -- to
comply with the 30-seconds rule :) ]
This is extremely difficult since unique features Hebrew has like
Niqqud-less spelling (which causes many words to have several spelling
options, only one legal but the others too-common to ignore) and
three-letter stems which have many deriviations. Furthermore, English words
like and, that, of, to etc. in Hebrew are represented as one letter appended
to the beginning of the word, forming a whole new word than the original.
Discarding them while indexing is not a smart move since one would try and
look for specific term *with* this initial and would not expect results
without. Furthermore, some words which uses these initials have another
meaning when pronounced differently (like KLBI - could be read as Ke-libi
[as my heart] where I can omit the leading K, and also as Kalbi [My dog]
where I cannot.

So, to overcome the challenges above, I was thinking about the query
inflation approach, having a negative boost for the inflated terms as you
suggested. I will appreciate any different takes on this one, as this is
going to be the first public Lucene Hebrew analyzer... Using this approach I
only need to make sure I do not inflate those too much (1024 is the standard
limit, right?).

Also, how can I check whether a word I inflated exists in the index BEFORE
executing the query? Is that recommended at all? -- I'm looking for the most
efficient way so search speed will still be measured in few m/s, as it is
now. The idea is to prevent, or minimize, the use of a dictionary, and
keeping the stemmer as simple as possible (and by that produce invalid words
and eliminate them before executing the search).

 It's worth noting, as the above-linked documentation for Field.setBoost()
does, that field boosts are not stored independently of other normalization
factors in the index.

Does this mean I should stick with boosting fields in the query phase only?

Itamar.

-Original Message-
From: Steven A Rowe [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 23, 2008 1:06 AM
To: java-user@lucene.apache.org
Subject: RE: Lucene, HTML and Hebrew

Hi Itamar,

In another thread, you wrote:

 Yesterday I sent an email to this group querying about some very 
 important (to me...) features of Lucene. I'm giving it another chance 
 before it goes unnoticed or forgotten. If it was too long please let 
 me know and I will email a shorter list of questions

I think I have something like a 30-second rule for posts on this list: if I
can't figure out what the question is within 30 seconds, I move on.  Your
post was so verbose that I gave up before I asked myself whether I could
help.  (Déjà vu - upon re-reading this paragraph, it sounds very much like
something Hoss has said on this list...)

Although I answer your original post below, please don't take this as
affirmation of your reminder approach.  In my experience, this strategy is
interpreted as badgering, and tends to affect response rate in the opposite
direction to that intended.

Short, focused questions will maximize the response rate here (and
elsewhere, I suspect).  Also, it helps if there is some 

FYI: parallel corpus in 22 languages

2008-01-24 Thread Andrzej Bialecki

Hi all,

Just FYI, perhaps this is old news for you ... This large corpus is 
freely available and it is pairwise sentence-aligned for all language 
combinations. This looks like a good resource for linguistic 
information, such as frequent words and phrases, n-gram profiles, etc.


http://wt.jrc.it/lt/Acquis/


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene, HTML and Hebrew

2008-01-24 Thread Steven A Rowe
Hi Itamar,

On 01/24/2008 at 2:55 PM, Itamar Syn-Hershko wrote:
  Lucene does not store proximity relations between data in different
  fields, only within individual fields
 
 So are 2 calls for doc-add with the same field but different
 texts are considered as 1 field (latter call being internally
 appended into the former, merged into one field), or as two
 instances of the same field which do not share proximity and
 frequency data? As it seems from what you wrote later in your
 response, it seems the case is the former.

Yes.  From 
http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/document/Document.html#add(org.apache.lucene.document.Field):

Adds a field to a document. Several fields may be
added with the same name. In this case, if the
fields are indexed, their text is treated as though
appended for the purposes of search.

 How can I inhibit this appending -- are there any more approaches than
 appending an invalid string like $$$?

Here's an idea, though it is entirely untested and may be completely false :) :

Lucene's Tokenizers are fed a Reader (in Java - I don't know about CLucene's 
setup, but I assume the interface is similar) and emit Tokens.  Assuming that 
each field value from same-named fields gets its own Reader, then you could 
create a custom Tokenizer that, for the first Token it emits, sets a position 
increment greater than one - in so doing, phrase matching across same-named 
field values should be inhibited:

http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int)

 I've been thinking about this a bit, and I think I'd go with
 one big field for all the content, and I'd want to incorporate
 the headers into it as well. How would I boost those specific
 words - so the content field can contain both all words and
 all headers in their original order (for proximity and
 frequency data to be valid), yet keep the terms which were
 originally in a header or a sub-header boosted?

Like I wrote in a previous response:

  One very coarse-grained boosting trick you could use is to
  repeat the text of headers, etc., that you want to boost.

This trick adjusts the term frequency of important terms.

I don't know of any other approaches besides this trick, except using field 
boosting, which would require you to have separate fields.

 So, to overcome the challenges above, I was thinking about the query
 inflation approach, having a negative boost for the inflated
 terms as you suggested.

Actually, I was referring to a reduced, but non-negative, boost - like 0.5 
instead of 1.0.  AFAIK, Lucene does not support negative boosts.

 I will appreciate any different takes on this one, as this is
 going to be the first public Lucene Hebrew analyzer...

One thought - for ambiguous terms, your stemming component could emit all of 
the alternatives at the same position.

 Using this approach I only need to make sure I do not inflate those too
 much (1024 is the standard limit, right?).

1024 is the default maximum number of BooleanClause children, but you can set 
this higher (or lower) should you desire:

http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)

 Also, how can I check whether a word I inflated exists in the
 index BEFORE executing the query? Is that recommended at all?

See IndexReader.terms():

http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/index/IndexReader.html#terms()

If, as an offline process, you were to trim your query expansion map so that it 
included only terms known to be in the index, the resulting simpler queries 
should impact positively on performance.

  It's worth noting, as the above-linked documentation for
  Field.setBoost() does, that field boosts are not stored
  independently of other normalization factors in the index.
 
 Does this mean I should stick with boosting fields in the
 query phase only?

No - I mentioned this only to alert you to the fact that field boosts are 
stored in the index only as part of the field norm, which is an amalgam 
including a couple of other factors.

Index-time field boosting could potentially do good things for you - it's worth 
trying out.

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stange exception while indexing

2008-01-24 Thread Michael McCandless


That means that one of the merges, which run in the background by  
default with 2.3, hit an unhandled exception.


Did you see another exception logged / printed to stderr before this  
one?


Mike

Cam Bazz wrote:


Does anyone have any idea about the error I got while indexing?

Best Regards,
-C.B.

Exception in thread main java.io.IOException: background merge hit
exception: _kq:C962870 _kr:C2591 into _ks [optimize]
at org.apache.lucene.index.IndexWriter.optimize 
(IndexWriter.java:1749)
at org.apache.lucene.index.IndexWriter.optimize 
(IndexWriter.java:1689)
at org.apache.lucene.index.IndexWriter.optimize 
(IndexWriter.java:1669)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stange exception while indexing

2008-01-24 Thread Cam Bazz
no. only after that there was a gc error.
I am also not using the compound index file format in order to increase
indexing speed. could it be because of that?
I will run the test case again tomorrow. What can I do to increase logging?

Best,
-C.B.

On Jan 24, 2008 11:52 PM, Michael McCandless [EMAIL PROTECTED]
wrote:


 That means that one of the merges, which run in the background by
 default with 2.3, hit an unhandled exception.

 Did you see another exception logged / printed to stderr before this
 one?

 Mike

 Cam Bazz wrote:

  Does anyone have any idea about the error I got while indexing?
 
  Best Regards,
  -C.B.
 
  Exception in thread main java.io.IOException: background merge hit
  exception: _kq:C962870 _kr:C2591 into _ks [optimize]
  at org.apache.lucene.index.IndexWriter.optimize
  (IndexWriter.java:1749)
  at org.apache.lucene.index.IndexWriter.optimize
  (IndexWriter.java:1689)
  at org.apache.lucene.index.IndexWriter.optimize
  (IndexWriter.java:1669)


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: stange exception while indexing

2008-01-24 Thread Michael McCandless


Hmm, you should have seen an exception before that one from optimize.

Can you post the GC error?  Was it an OutOfMemoryError situation?

Mike

On Jan 24, 2008, at 5:32 PM, Cam Bazz wrote:


no. only after that there was a gc error.
I am also not using the compound index file format in order to  
increase

indexing speed. could it be because of that?
I will run the test case again tomorrow. What can I do to increase  
logging?


Best,
-C.B.

On Jan 24, 2008 11:52 PM, Michael McCandless  
[EMAIL PROTECTED]

wrote:



That means that one of the merges, which run in the background by
default with 2.3, hit an unhandled exception.

Did you see another exception logged / printed to stderr before this
one?

Mike

Cam Bazz wrote:


Does anyone have any idea about the error I got while indexing?

Best Regards,
-C.B.

Exception in thread main java.io.IOException: background merge hit
exception: _kq:C962870 _kr:C2591 into _ks [optimize]
at org.apache.lucene.index.IndexWriter.optimize
(IndexWriter.java:1749)
at org.apache.lucene.index.IndexWriter.optimize
(IndexWriter.java:1689)
at org.apache.lucene.index.IndexWriter.optimize
(IndexWriter.java:1669)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stange exception while indexing

2008-01-24 Thread Michael McCandless


Oh, also, I don't think not using CFS would lead to this, unless it's  
somehow triggering too many file descriptors...


Mike

Cam Bazz wrote:


no. only after that there was a gc error.
I am also not using the compound index file format in order to  
increase

indexing speed. could it be because of that?
I will run the test case again tomorrow. What can I do to increase  
logging?


Best,
-C.B.

On Jan 24, 2008 11:52 PM, Michael McCandless  
[EMAIL PROTECTED]

wrote:



That means that one of the merges, which run in the background by
default with 2.3, hit an unhandled exception.

Did you see another exception logged / printed to stderr before this
one?

Mike

Cam Bazz wrote:


Does anyone have any idea about the error I got while indexing?

Best Regards,
-C.B.

Exception in thread main java.io.IOException: background merge hit
exception: _kq:C962870 _kr:C2591 into _ks [optimize]
at org.apache.lucene.index.IndexWriter.optimize
(IndexWriter.java:1749)
at org.apache.lucene.index.IndexWriter.optimize
(IndexWriter.java:1689)
at org.apache.lucene.index.IndexWriter.optimize
(IndexWriter.java:1669)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Design questions

2008-01-24 Thread spring
 Or, you could just do things twice. That is, send your text through
 a TokenStream, then call next() and count. Then send it all
 through doc.add().

Hm.

This means read the content twice, doesn't matter using an own analyzer oder
overriding/wrapping the main analyzer.

Is there anywhere a hook where I can grap the last token when I call
Document#add?

Thank you.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene to index OCR text

2008-01-24 Thread Renaud Waldura
I've been poking around the list archives and didn't really come up against
anything interesting. Anyone using Lucene to index OCR text? Any
strategies/algorithms/packages you recommend?
 
I have a large collection (10^7 docs) that's mostly the result of OCR. We
index/search/etc. with Lucene without any trouble, but OCR errors are a
problem, when doing exact phrase matches in particular. I'm looking for
ideas on how to deal with this thorny problem.
 
--
Renaud Waldura
Applications Group Manager
Library and Center for Knowledge Management
University of California, San Francisco
(415) 502-6660

 


MapReduce usage with Lucene Indexing

2008-01-24 Thread roger dimitri
Hi,
   I am very new to Lucene  Hadoop, and I have a project where I need to
use Lucene to index some input given either as a a huge collection of
Java objects or one huge java object. 
  I read about Hadoop's MapReduce utilities and I want to leverage that feature 
in my case described above. 
 
Can some one please tell me how I can approach the problem described
above. Because all the Hadoop's MapReduce examples out there show only
File based input and don't explicitly deal with data coming in as a
huge Java object or so to speak.

Any help is greatly appreciated.

Thanks,
Roger



  

Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Re: Lucene to index OCR text

2008-01-24 Thread Erick Erickson
Lots of luck to you, because I haven't a clue. My company deals with
OCR data and we haven't had a single workable idea. Of course, our
data sets are minuscule compared to what you're talking about, so we
haven't tried to heuristically clean up the data.

But given that Google is scanning the entire U of Mich library, there has
to be an answer out there, but I wonder if it's applicable to already OCRd
data or whether it's the scanning itself.

There are, as you well know, two issues. First, are the words
recognizable. As in actual English words. Which is easily checkable via
a dictionary. Which doesn't help much since I've seen OCR that consists
of English words that are total nonsense. Assuming you're scanning
English texts. Assuming it's modern English..

Second, particularly in our case, we have a very significant number of names
to deal with. So a dictionary check is pretty useless.

We've squirmed out of the problem by having the tables of contents keyed
in by hand and then providing our users with links to the OCR image of the
scanned data. Since this is genealogy research, it at least gives them a way
to verify what our searches return. But inevitably there are false hits as
well
as false misses.

I've considered creating a dictionary of non-English words on the assumption
that
there will be a finite number of mis-spellings. But this is OCR data, so the
set of
misspelled words could very well be bigger than the total number of words in
the
English language, depending on the condition of your source and how well the
OCR
data is done. But, again, our situation is that the projects aren't large
enough
to make significant investments in even exploring this.

I suppose that one could think about asking a Dictionary program for
suggestions,
but I haven't a clue how useful that would be. Especially for names or
technical
data.

The LDS church (The Church of Jesus Christ of Latter-day Saints) is doing
something interesting that has the flavor of [EMAIL PROTECTED] They're getting
volunteers to
key in pages. Two different volunteers key in each page. Then a comparison
is
done and the differences are arbitrated.

As you can tell, I have nothing really useful to suggest on the scale you're
talking
about. 10^7 is a LOT of documents.

But I'd also be very interested in anything you come across. Especially in
the way
of cleaning existing OCRd data. Mostly, I'm expressing sympathy for the size
and complexity of the task you're undertaking G..

Best
Erick


On Jan 24, 2008 8:43 PM, Renaud Waldura [EMAIL PROTECTED]
wrote:

 I've been poking around the list archives and didn't really come up
 against
 anything interesting. Anyone using Lucene to index OCR text? Any
 strategies/algorithms/packages you recommend?

 I have a large collection (10^7 docs) that's mostly the result of OCR. We
 index/search/etc. with Lucene without any trouble, but OCR errors are a
 problem, when doing exact phrase matches in particular. I'm looking for
 ideas on how to deal with this thorny problem.

 --
 Renaud Waldura
 Applications Group Manager
 Library and Center for Knowledge Management
 University of California, San Francisco
 (415) 502-6660





Re: Lucene to index OCR text

2008-01-24 Thread Kyle Maxwell
 I've been poking around the list archives and didn't really come up against
 anything interesting. Anyone using Lucene to index OCR text? Any
 strategies/algorithms/packages you recommend?

 I have a large collection (10^7 docs) that's mostly the result of OCR. We
 index/search/etc. with Lucene without any trouble, but OCR errors are a
 problem, when doing exact phrase matches in particular. I'm looking for
 ideas on how to deal with this thorny problem.

How about Letter-by-letter ngrams coupled with SpanQueries (or more
likely, a custom query utilizing the TermPositions iterator)?

-- 
Kyle Maxwell
Software Engineer
CastTV, Inc
http://www.casttv.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[ANNOUNCE] Lucene Java 2.3.0 release available

2008-01-24 Thread Michael Busch
Release 2.3.0 of Lucene Java is now available!

Many new features, optimizations, and bug fixes have been added since
2.2, including:

  * significantly improved indexing performance
  * segment merging in background threads
  * refreshable IndexReaders
  * faster StandardAnalyzer and improved Token API
  * TermVectorMapper to customize how term vectors are loaded
  * live backups (without pausing indexing) with SnapshotDeletionPolicy
  * CheckIndex tool to test  recover a corrupt index
  * pluggable MergePolicy  MergeScheduler
  * partial optimize(int maxNumSegments) method
  * New contrib module for working with Wikipedia content

The detailed change log is at:
http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_3_0/CHANGES.txt

Lucene 2.3 includes index format changes that are not readable by older
versions of Lucene.  Lucene 2.3 can both read and update older Lucene
indexes.  Adding to an index with an older format will cause it to be
converted to the newer format.

Binary and source distributions are available at
http://www.apache.org/dyn/closer.cgi/lucene/java/

Lucene artifacts are also available in the Maven2 repository at
http://repo1.maven.org/maven2/org/apache/lucene/

-Michael (on behalf of the Lucene team)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Threads blocking on isDeleted when swapping indices for a very long time...

2008-01-24 Thread Michael Stoppelman
Hi all,

I've been tracking down a problem happening in our production environment.
When we switch an index after doing deletes  adds, running some searches,
and finally changing the pointer
from old index to new all the threads start stacking up all waiting on
isDeleted(). The threads seem to finish, they just get really slow taking up
to 30 - 60 seconds.

The problem has been discussed here before in 2005:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200510.mbox/[EMAIL 
PROTECTED]


Does anyone have any suggestions on how to work around this?

-M


Re: Archiving Index using partitions

2008-01-24 Thread vivek sar
Thanks Otis for your response. I've few more questions,

1) Is it recommended to do index partitioning for large indexes?
   - We index around 35 fields (storing only two of them - simple ids)
   - Each document is around 200 bytes
   - Our index grows to around 50G a week

2) The reason I could think for partitioning would be,
  - optimization would be faster on smaller indexes
  - search would be faster if I have to search only on specific partition
  - I would be able to archive old partitions
  - Even if a partition gets corrupt I wouldn't lose all data

Is this correct? Are there any other reasons?

Thanks,
-vivek



On Jan 21, 2008 2:32 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote:

 Why not just design your system to roll over to a new index on a weekly a 
 basis (new IndexWriter on a new index dir, roughly speaking)?  You can't 
 partition a single Document, if that is what you are asking.  But you can 
 create multiple smaller (e.g. weekly indices) instead one large one, and then 
 every 2 weeks archive the one  2 weeks old.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: vivek sar [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Monday, January 21, 2008 3:06:50 PM
 Subject: Archiving Index using partitions

 Hi,

  As a requirement I need to be able to archive any indexes older than
 2 weeks (due to space and performance reasons). That means I would
 need to maintain weekly indexes. Here are my questions,

 1) What's the best way to partition indexes using Lucene?
 2) Is there a way I can partition documents, but not indexes? I don't
 want each partitioned index to be a full index, as that would be waste
 of space. We collect over 10K new documents per min (with each
 document around 250 bytes).
 3) Is ParallelMultiSearcher the way to go for partitioned indexes? Do
 I ever have to merge these partitioned indexes?
 4) I'm hoping I can reload the archived indexes in future if needed.

 Not sure if there is a standard way to archive the indexes using
  Lucene.

 Thanks,
 -vivek

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]