Re: Call Lucene default command line Search from PHP script

2008-03-25 Thread Mathieu Lecarme

milu07 a écrit :

Hello,

My machine is Ubuntu 7.10. I am working with Apache Lucene. I have done with
indexer and tried with command line Searcher (the default command line
included in Lucene package: http://lucene.apache.org/java/2_3_1/demo2.html).
When I use this at command line:

java Searcher -query algorithm

it works and returns a list of results to me. Here 'algorithm' is the
keyword to search.

However, I want to have a web search interface written in PHP, I use PHP
exec() to call this Searcher from my PHP script:

exec("java Searcher -query algorithm ", $arr, $retVal);
[I also tried: exec("java Searcher -query 'algorithm' ", $arr, $retVal)]

It does not work. I print the value of $retVal, it is 1.

I come back and try: exec("java Searcher -query algorithm 2>&1 ", $arr,
$retVal);
I receive: 
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/lucene/analysis/Analyzer  
and $retVal is 1


In the command line Searcher.java of Lucene, it imports many libraries, is
this the problem?
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyz er;


I guess this is the problem of path. However, I do not know how to fix it
because it works in command line ($CLASSPATH points to the .jar file of
Lucene library). May be PHP does not know $CLASSPATH. So, I add Lucene lib
to $PATH:

export PATH=$PATH:/usr/lib/lucene-core-2.3.1.jar:/usr/lib

However, I get the same error message when I try: exec("java Searcher -query
algorithm 2>&1 ", $arr, $retVal);
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/lucene/analysis/Analyzer

Could you please help?

Thank you,
  

using command line from PHP is a bad idea.
socket is a better way :
https://admin.garambrogne.net/projets/passerelle/browser/trunk/goniometre

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Field values ...

2008-03-25 Thread Dragon Fly
Thanks.

> Date: Mon, 24 Mar 2008 21:03:13 -0700
> From: [EMAIL PROTECTED]
> To: java-user@lucene.apache.org
> Subject: RE: Field values ...
> 
> 
> : The Id and Phone fields are stored.  So I can just do a MatchAllQuery as 
> : you suggested.  I have read about field selectors on this mailing list 
> : but have never used it.  Does anyone know where I can find some sample 
> : code? Thank you.
> 
> there's a couple of reusable implementations in subversion...
> 
> http://www.krugle.org/kse/files?query=%22implements%20FieldSelector%22%20lucene&lang=java&findin=code
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

_
Watch “Cause Effect,” a show about real people making a real difference.  Learn 
more.
http://im.live.com/Messenger/IM/MTV/?source=text_watchcause

Improving Index Search Performance

2008-03-25 Thread Shailendra Mudgal
Hi Everyone,

We are using Lucene to search on a index of around 20G size with around 3
million documents. We are facing performance issues loading large results
from the index. Based on the various posts on the forum and documentation,
we have made the following code changes to improve the performance:

i. Modified the code to use HitCollector instead of Hits since we will be
loading all the documents in the index based on keyword matching
ii. Added MapFieldSelector to load only selected fields(2 fields only)
instead of all the 14

After all these changes, it seems to be  taking around 90 secs to load 17k
documents. After profiling, we found that the max time is spent in *
searcher.doc(id,selector).

*Here is the code:

*public void collect(int id, float score) {
try {
MapFieldSelector selector = new MapFieldSelector(new
String[] {COMPANY_ID, ID});
doc = searcher.doc(id, selector);
mappedCompanies = doc.getValues(COMPANY_ID);
} catch (IOException e) {
logger.debug("inside IDCollector.collect()
:"+e.getMessage());
}
}*

*
*We also read in one of the posts that we should use bitSet.set(doc)
instead of calling searcher.doc(id). But we are unable to to understand how
this might help in our case since we will anyway have to load the document
to get the other required field(company_id). Also we observed that the
searcher is actually using only 1G RAM though we have 4G allocated to it.

Can someone suggest if there is any other optimization that can done to
improve the search performance on MultiSearcher. Any help would be
appreciated.

Thanks,
Vipin


Integrating Spell Checker contributed to Lucene

2008-03-25 Thread Ivan Vasilev

Hi Guys,

Has anybody integrated the Spell Checker contributed to Lucene. I need 
advise from where to get free dictionary file (one that contains all 
words in English) that could be used to create instance of 
PlainTextDictionary class. I currently use for my tests responding files 
from Jazzy and JADT projects, but I think I do not have right to use 
them officially outside of their applications.


Best Regards,
Ivan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Integrating Spell Checker contributed to Lucene

2008-03-25 Thread Mathieu Lecarme

Ivan Vasilev a écrit :

Hi Guys,

Has anybody integrated the Spell Checker contributed to Lucene. 

http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index
https://issues.apache.org/jira/browse/LUCENE-1190

I need advise from where to get free dictionary file (one that 
contains all words in English) that could be used to create instance 
of PlainTextDictionary class.

all english word is a nonsense. Have a look at wordnet and hunspell.

I currently use for my tests responding files from Jazzy and JADT 
projects, but I think I do not have right to use them officially 
outside of their applications.


Best Regards,
Ivan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



hitcollector topdocs

2008-03-25 Thread JensBurkhardt

Hi everybody,

I was searching for informations about the hitcollector. I was wondering if
the value of the fields have to be stored or not. i tested it and it worked
both but i'm still not really sure about it.
Second question is,  can i work with tokenized fields?

Best regards

Jens
-- 
View this message in context: 
http://www.nabble.com/hitcollector-topdocs-tp16275287p16275287.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Improving Index Search Performance

2008-03-25 Thread Toke Eskildsen
On Tue, 2008-03-25 at 18:13 +0530, Shailendra Mudgal wrote:
> We are using Lucene to search on a index of around 20G size with around 3
> million documents. We are facing performance issues loading large results
> from the index. [...]
> After all these changes, it seems to be  taking around 90 secs to load 17k
> documents. [...]

That's fairly slow. Are you doing any warm-up? It is my experience that
it helps tremendously with performance.

I tried requesting a stored field from all hits for all searches with
logged queries on our index (9 million documents, 37GB), no fancy
tricks, just Hits and hit.get(fieldname). For the first couple of
minutes, using standard harddisks, performance was about 2-300
field-requests/second. After that, the speed increased to about 2-3000
field-requests/second.

Using solid state drives, the same pattern could be seen, just with much
lower warm-up time before the full speed kicked in.

> *Here is the code:
> 
> *public void collect(int id, float score) {
> try {
> MapFieldSelector selector = new MapFieldSelector(new
> String[] {COMPANY_ID, ID});
> doc = searcher.doc(id, selector);
> mappedCompanies = doc.getValues(COMPANY_ID);
> } catch (IOException e) {
> logger.debug("inside IDCollector.collect()
> :"+e.getMessage());
> }
> }*
> 
> *

There's no need to initialize the selector for every collect-call.
Try moving the initialization outside of the collect method.

> [...] Also we observed that the searcher is actually using only 1G RAM though
>  we have 4G allocated to it.

The system will (hopefully) utilize the free RAM for disk-cache, so the
last 3GB are not wasted.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: explain() - fieldnorm

2008-03-25 Thread JensBurkhardt

another problem just occurred. These are the results from explain() :

0.27576536 = (MATCH) product of:
  0.827296 = (MATCH) sum of:
0.827296 = (MATCH) sum of:
  0.24544832 = (MATCH) weight(ti:genetik in 1849319), product of:
0.015469407 = queryWeight(ti:genetik), product of:
  10.577795 = idf(docFreq=270)
  0.0014624415 = queryNorm
15.866693 = (MATCH) fieldWeight(ti:genetik in 1849319), product of:
  1.0 = tf(termFreq(ti:genetik)=1)
  10.577795 = idf(docFreq=270)
  1.5 = fieldNorm(field=ti, doc=1849319)
  0.58184767 = (MATCH) weight(au:knippers in 1849319), product of:
0.020028148 = queryWeight(au:knippers), product of:
  13.695007 = idf(docFreq=11)
  0.0014624415 = queryNorm
29.051497 = (MATCH) fieldWeight(au:knippers in 1849319), product of:
  1.4142135 = tf(termFreq(au:knippers)=2)
  13.695007 = idf(docFreq=11)
  1.5 = fieldNorm(field=au, doc=1849319)
  0.3334 = coord(1/3)

0.27576536 = (MATCH) product of:
  0.827296 = (MATCH) sum of:
0.827296 = (MATCH) sum of:
  0.24544832 = (MATCH) weight(ti:genetik in 3221603), product of:
0.015469407 = queryWeight(ti:genetik), product of:
  10.577795 = idf(docFreq=270)
  0.0014624415 = queryNorm
15.866693 = (MATCH) fieldWeight(ti:genetik in 3221603), product of:
  1.0 = tf(termFreq(ti:genetik)=1)
  10.577795 = idf(docFreq=270)
  1.5 = fieldNorm(field=ti, doc=3221603)
  0.58184767 = (MATCH) weight(au:knippers in 3221603), product of:
0.020028148 = queryWeight(au:knippers), product of:
  13.695007 = idf(docFreq=11)
  0.0014624415 = queryNorm
29.051497 = (MATCH) fieldWeight(au:knippers in 3221603), product of:
  1.4142135 = tf(termFreq(au:knippers)=2)
  13.695007 = idf(docFreq=11)
  1.5 = fieldNorm(field=au, doc=3221603)
  0.3334 = coord(1/3)

As you can see, both are exactly the same. The thing i don't understand is,
that the two documents have different documentboosts (the first one got an
boost of 1.62 , the second of 1.65) - the boosts are different because the
two books got different publication years - but explain() tells me that my
fieldNorm value is 1.5.
While indexing i use a new similarity class where lengthNorm just returns 1,
so the field length does not matter anymore.

Best Regards 

Jens Burkhardt


hossman wrote:
> 
> : As my subject is telling, i have a little problem with analyzing the
> : explain() output.
> : I know, that the fieldnorm value consists out of "documentboost,
> fieldboost
> : and lengthNorm". 
> : Is is possible to recieve the single values? I know that they are
> multiplied
> : while indexing but
> : can they be stored so that i can read them when i analyze my search?
> 
> the number of terms the docs have in a given field can be determined by 
> doing a nested iteration over a TermEnum and TermDoc and keeping count, 
> but there is no way to keep extract the document boost vs the field boost 
> -- if you want to know what those were later you have to store them 
> yourselves (in a stored field perhaps).
> 
> : The Problem is, that i have 2 Documents I want to compare but the only
> : difference is the fieldnorm value
> : and i don't know which value exactly makes this difference.
> 
> typically the answer to that question for me is "length" because i don't 
> use field boosts and doc boosts -- if you *do* use field boosts or doc 
> boosts, you would typically know what you had, and could check what boost 
> values you had used later (based on whatever source you orriginally built 
> your index from)
> 
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/explain%28%29---fieldnorm-tp15717182p16276999.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread Jake Mannix
Uwe,
  This is a little off thread-topic, but I was wondering how your
search relevance and search performance has fared with this
bigram-based index.  Is it significantly better than before you use
the NGramAnalyzer?
   -jake



On 3/24/08, Uwe Goetzke <[EMAIL PROTECTED]> wrote:
> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
>
> Most data is processed by
>   fTextTokenStream = result = new
> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>   result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter
> modified that ö -> oe
>   result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>   result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
> //just a
> bigram tokenizer
>
> We use our own queryparser. The bigramms are searched with a tolerant phrase
> query, scoring in a doc the greatest bigramms clusters covering the phrase
> token.
>
> Best Regards
>
> Uwe
>
> -Ursprüngliche Nachricht-
> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED]
> Gesendet: Freitag, 21. März 2008 16:25
> An: java-user@lucene.apache.org
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> Could you tell what Analyzer do you use when you marked so big indexing
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
> reason is in it. You can see the pre last report in the thread "Indexing
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to
> index in our system. We have time control on this in the debug mode. NOW
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has
> improvement of about 8%. As our system is very big and complex I am
> wondering if really the whole process of indexing is reduces so
> remarkably and our system causes this slowdown or may be Lucene does
> some optimizations on the index, merges or something else and this is
> the reason the total process of indexing to be not so reasonably faster.
>
> Best Regards,
> Ivan
>
>
>
> Uwe Goetzke wrote:
> > This week I switched the lucene library version on one customer system.
> > The indexing speed went down from 46m32s to 16m20s for the complete task
> > including optimisation. Great Job!
> > We index product catalogs from several suppliers, in this case around
> > 56.000 product groups and 360.000 products including descriptions were
> > indexed.
> >
> > Regards
> >
> > Uwe
> >
> >
> >
> > ---
> > Healy Hudson GmbH - D-55252 Mainz Kastel
> > Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
> >
> > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger
> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte
> loschen Sie danach diese Email.
> > This email is confidential. If you are not the intended recipient, you
> must not disclose or use this information contained in it. If you have
> received this email in error please tell us immediately by return email and
> delete the document.
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > __ NOD32 2913 (20080301) Information __
> >
> > This message was checked by NOD32 antivirus system.
> > http://www.eset.com
> >
> >
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
>
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger
> sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte
> löschen Sie danach diese Email.
> This email is confidential. If you are not the intended recipient, you must
> not disclose or use this information contained in it. If you have received
> this email in error please tell us immediately by return email and delete
> the document.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-- 
Sent from Gmail for mobile | mobile.google.com

--

AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread Uwe Goetzke
Jake,

With the bigram-based index we gave up for the struggle to find a well working 
language based index.
We had implemented soundex (or different "sound"-alikes) and hyphenating but 
failed to deliver a user explainable search result ("why is this ranked higher" 
and so on...). One reason may be that product descriptions contain a lot of 
abbreviations.

The index size grew about 30%.
The search performance seems a bit slower but I no concrete figures. The 
evaluation for a for one document is a bit more complex than a phrase query. 
One reason of course is that there a more terms evaluated. But nevertheless it 
is quite good.

The search relevance improved tremendously. Missing characters, switched 
letters and partial word fragments are no real problems any more (of course 
dependent on the length of the search word).
Search term "weekday" finds also "day of the week", "disabigaute" finds 
"disambiguate".
The algorithms I developed might not fit other domains but for multi language 
catalogs of products it works quite well for us. So far...


Regards Uwe

-Ursprüngliche Nachricht-
Von: Jake Mannix [mailto:[EMAIL PROTECTED] 
Gesendet: Dienstag, 25. März 2008 17:13
An: java-user@lucene.apache.org
Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Uwe,
  This is a little off thread-topic, but I was wondering how your
search relevance and search performance has fared with this
bigram-based index.  Is it significantly better than before you use
the NGramAnalyzer?
   -jake



On 3/24/08, Uwe Goetzke <[EMAIL PROTECTED]> wrote:
> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
>
> Most data is processed by
>   fTextTokenStream = result = new
> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>   result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter
> modified that ö -> oe
>   result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>   result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
> //just a
> bigram tokenizer
>
> We use our own queryparser. The bigramms are searched with a tolerant phrase
> query, scoring in a doc the greatest bigramms clusters covering the phrase
> token.
>
> Best Regards
>
> Uwe
>
> -Ursprüngliche Nachricht-
> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED]
> Gesendet: Freitag, 21. März 2008 16:25
> An: java-user@lucene.apache.org
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> Could you tell what Analyzer do you use when you marked so big indexing
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
> reason is in it. You can see the pre last report in the thread "Indexing
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake
> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to
> index in our system. We have time control on this in the debug mode. NOW
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has
> improvement of about 8%. As our system is very big and complex I am
> wondering if really the whole process of indexing is reduces so
> remarkably and our system causes this slowdown or may be Lucene does
> some optimizations on the index, merges or something else and this is
> the reason the total process of indexing to be not so reasonably faster.
>
> Best Regards,
> Ivan
>
>
>
> Uwe Goetzke wrote:
> > This week I switched the lucene library version on one customer system.
> > The indexing speed went down from 46m32s to 16m20s for the complete task
> > including optimisation. Great Job!
> > We index product catalogs from several suppliers, in this case around
> > 56.000 product groups and 360.000 products including descriptions were
> > indexed.
> >
> > Regards
> >
> > Uwe
> >
> >
> >
> > ---
> > Healy Hudson GmbH - D-55252 Mainz Kastel
> > Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
> >
> > Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger
> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
> umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte
> loschen Sie danach diese Email.
> > This email is confidential. If you are not the intended recipient, you
> must not disclose or use this information contained in it. If you have
> received this email in error please tell us immediately by return email and
> delete the document.
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > __ NOD32 2913 (20080301) Information

random accessing term value

2008-03-25 Thread John Wang
Hi:

   Is there a way to random accessing term value in a field? e.g.

   in my field, content, the terms are: lucene, is, cool

   Is there a way to access content[2] -> cool?

Thanks

-John


Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread Jay

Hi Uwe,

I am curious what NGramStemFilter is? Is it a combination of porter 
stemming and word ngram identification?


Thanks!

Jay

Uwe Goetzke wrote:

Hi Ivan,
No, we do not use StandardAnalyser or StandardTokenizer.

Most data is processed by 
	fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);

result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  
modified that ö -> oe
result = new org.apache.lucene.analysis.LowerCaseFilter(result);
result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
//just a bigram tokenizer

We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. 


Best Regards

Uwe

-Ursprüngliche Nachricht-
Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 21. März 2008 16:25

An: java-user@lucene.apache.org
Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

Could you tell what Analyzer do you use when you marked so big indexing 
speedup?
If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
reason is in it. You can see the pre last report in the thread "Indexing 
Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
that now is generated by JFlex instead of JavaCC.
I am asking because I noticed a great speedup in adding documents to 
index in our system. We have time control on this in the debug mode. NOW 
THEY ARE ADDED 5 TIMES FASTER!!!
But in the same time the total process of indexing in our case has 
improvement of about 8%. As our system is very big and complex I am 
wondering if really the whole process of indexing is reduces so 
remarkably and our system causes this slowdown or may be Lucene does 
some optimizations on the index, merges or something else and this is 
the reason the total process of indexing to be not so reasonably faster.


Best Regards,
Ivan



Uwe Goetzke wrote:

This week I switched the lucene library version on one customer system.
The indexing speed went down from 46m32s to 16m20s for the complete task
including optimisation. Great Job!
We index product catalogs from several suppliers, in this case around
56.000 product groups and 360.000 products including descriptions were
indexed.

Regards

Uwe



---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, 
durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


__ NOD32 2913 (20080301) Information __

This message was checked by NOD32 antivirus system.
http://www.eset.com



  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: random accessing term value

2008-03-25 Thread Erik Hatcher


On Mar 25, 2008, at 1:32 PM, John Wang wrote:

   Is there a way to random accessing term value in a field? e.g.

   in my field, content, the terms are: lucene, is, cool

   Is there a way to access content[2] -> cool?


Via term vectors, or reanalysis of the field are two that come to  
mind.  Maybe other ways?


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Improving Index Search Performance

2008-03-25 Thread Paul Elschot
Shailendra,

Have a look at the javadocs of HitCollector:
http://lucene.apache.org/java/2_3_0/api/core/org/apache/lucene/search/HitCollector.html

The problem is with the use of the disk head, when retrieving
the documents during collecting, the disk head has to move
between the inverted index and the stored documents; see also
the file formats.

To avoid such excessive disk head movement, you need to collect
all (or at least many more than 1 of) your document ids during
collect(), for example into an int[].
After collecting retrieve the all the docs with Searcher.doc().

Also, for the same reason, retrieving docs is best done in doc id
order, but that is unlikely to go wrong as doc ids are normally
collected in increasing order.

Regards,
Paul Elschot


Op Tuesday 25 March 2008 13:43:18 schreef Shailendra Mudgal:
> Hi Everyone,
>
> We are using Lucene to search on a index of around 20G size with
> around 3 million documents. We are facing performance issues loading
> large results from the index. Based on the various posts on the forum
> and documentation, we have made the following code changes to improve
> the performance:
>
> i. Modified the code to use HitCollector instead of Hits since we
> will be loading all the documents in the index based on keyword
> matching ii. Added MapFieldSelector to load only selected fields(2
> fields only) instead of all the 14
>
> After all these changes, it seems to be  taking around 90 secs to
> load 17k documents. After profiling, we found that the max time is
> spent in * searcher.doc(id,selector).
>
> *Here is the code:
>
> *public void collect(int id, float score) {
> try {
> MapFieldSelector selector = new
> MapFieldSelector(new String[] {COMPANY_ID, ID});
> doc = searcher.doc(id, selector);
> mappedCompanies = doc.getValues(COMPANY_ID);
> } catch (IOException e) {
> logger.debug("inside IDCollector.collect()
>
> :"+e.getMessage());
>
> }
> }*
>
> *
> *We also read in one of the posts that we should use bitSet.set(doc)
> instead of calling searcher.doc(id). But we are unable to to
> understand how this might help in our case since we will anyway have
> to load the document to get the other required field(company_id).
> Also we observed that the searcher is actually using only 1G RAM
> though we have 4G allocated to it.
>
> Can someone suggest if there is any other optimization that can done
> to improve the search performance on MultiSearcher. Any help would be
> appreciated.
>
> Thanks,
> Vipin



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Improving Index Search Performance

2008-03-25 Thread Chris Hostetter

: *We also read in one of the posts that we should use bitSet.set(doc)
: instead of calling searcher.doc(id). But we are unable to to understand how
: this might help in our case since we will anyway have to load the document
: to get the other required field(company_id). Also we observed that the
: searcher is actually using only 1G RAM though we have 4G allocated to it.

in addition to Paul's previous excellent suggestion, note that if:
  * companyId is a single value field (ie: no document has more then one)
  * companyId is indexed

you can use the FieldCache to lookup the compnayId for each doc.  on the 
aggregate this will most likely be much faster then accessing the stored 
fields.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread Otis Gospodnetic
Jay,

Have a look at Lucene config, it's all there, including tests.  This filter 
will take a token such as "foobar" and chop it up into n-grams (e.g. foobar -> 
fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram size 
and even min and max n-gram size.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Jay <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 1:32:24 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

I am curious what NGramStemFilter is? Is it a combination of porter 
stemming and word ngram identification?

Thanks!

Jay

Uwe Goetzke wrote:
> Hi Ivan,
> No, we do not use StandardAnalyser or StandardTokenizer.
> 
> Most data is processed by 
> fTextTokenStream = result = new 
> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
> result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  
> modified that ö -> oe
> result = new org.apache.lucene.analysis.LowerCaseFilter(result);
> result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just 
> a bigram tokenizer
> 
> We use our own queryparser. The bigramms are searched with a tolerant phrase 
> query, scoring in a doc the greatest bigramms clusters covering the phrase 
> token. 
> 
> Best Regards
> 
> Uwe
> 
> -Ursprüngliche Nachricht-
> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
> Gesendet: Freitag, 21. März 2008 16:25
> An: java-user@lucene.apache.org
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
> 
> Hi Uwe,
> 
> Could you tell what Analyzer do you use when you marked so big indexing 
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
> reason is in it. You can see the pre last report in the thread "Indexing 
> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
> that now is generated by JFlex instead of JavaCC.
> I am asking because I noticed a great speedup in adding documents to 
> index in our system. We have time control on this in the debug mode. NOW 
> THEY ARE ADDED 5 TIMES FASTER!!!
> But in the same time the total process of indexing in our case has 
> improvement of about 8%. As our system is very big and complex I am 
> wondering if really the whole process of indexing is reduces so 
> remarkably and our system causes this slowdown or may be Lucene does 
> some optimizations on the index, merges or something else and this is 
> the reason the total process of indexing to be not so reasonably faster.
> 
> Best Regards,
> Ivan
> 
> 
> 
> Uwe Goetzke wrote:
>> This week I switched the lucene library version on one customer system.
>> The indexing speed went down from 46m32s to 16m20s for the complete task
>> including optimisation. Great Job!
>> We index product catalogs from several suppliers, in this case around
>> 56.000 product groups and 360.000 products including descriptions were
>> indexed.
>>
>> Regards
>>
>> Uwe
>>
>>
>>
>> ---
>> Healy Hudson GmbH - D-55252 Mainz Kastel
>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>
>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger 
>> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie 
>> diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte 
>> umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte 
>> loschen Sie danach diese Email.
>> This email is confidential. If you are not the intended recipient, you must 
>> not disclose or use this information contained in it. If you have received 
>> this email in error please tell us immediately by return email and delete 
>> the document.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> __ NOD32 2913 (20080301) Information __
>>
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>>
>>
>>
>>   
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> ---
> Healy Hudson GmbH - D-55252 Mainz Kastel
> Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076
> 
> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
> dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
> Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
> mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
> danach diese Email.
> This email is confidential. If you are not the

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread Jay
Sorry, I could not find the filter in the 2.3 API class list (core + 
contrib + test). I am not ware of lucene config file either. Could you 
please tell me where it is in 2.3 release?


Thanks!

Jay

Otis Gospodnetic wrote:

Jay,

Have a look at Lucene config, it's all there, including tests.  This filter will take a 
token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba 
ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram 
size.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Jay <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 1:32:24 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

I am curious what NGramStemFilter is? Is it a combination of porter 
stemming and word ngram identification?


Thanks!

Jay

Uwe Goetzke wrote:

Hi Ivan,
No, we do not use StandardAnalyser or StandardTokenizer.

Most data is processed by 
fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);

result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified 
that ö -> oe
result = new org.apache.lucene.analysis.LowerCaseFilter(result);
result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a 
bigram tokenizer

We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. 


Best Regards

Uwe

-Ursprüngliche Nachricht-
Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 21. März 2008 16:25

An: java-user@lucene.apache.org
Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

Could you tell what Analyzer do you use when you marked so big indexing 
speedup?
If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
reason is in it. You can see the pre last report in the thread "Indexing 
Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
that now is generated by JFlex instead of JavaCC.
I am asking because I noticed a great speedup in adding documents to 
index in our system. We have time control on this in the debug mode. NOW 
THEY ARE ADDED 5 TIMES FASTER!!!
But in the same time the total process of indexing in our case has 
improvement of about 8%. As our system is very big and complex I am 
wondering if really the whole process of indexing is reduces so 
remarkably and our system causes this slowdown or may be Lucene does 
some optimizations on the index, merges or something else and this is 
the reason the total process of indexing to be not so reasonably faster.


Best Regards,
Ivan



Uwe Goetzke wrote:

This week I switched the lucene library version on one customer system.
The indexing speed went down from 46m32s to 16m20s for the complete task
including optimisation. Great Job!
We index product catalogs from several suppliers, in this case around
56.000 product groups and 360.000 products including descriptions were
indexed.

Regards

Uwe



---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, 
durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


__ NOD32 2913 (20080301) Information __

This message was checked by NOD32 antivirus system.
http://www.eset.com



  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recip

Re: hitcollector topdocs

2008-03-25 Thread Grant Ingersoll

Hi Jens,

I'm having a bit of a hard time following this, so perhaps you could  
rephrase, or show your sample code or explain a bit more about what  
you are trying to do at a higher level?


Cheers,
Grant

On Mar 25, 2008, at 10:46 AM, JensBurkhardt wrote:



Hi everybody,

I was searching for informations about the hitcollector. I was  
wondering if
the value of the fields have to be stored or not. i tested it and it  
worked

both but i'm still not really sure about it.
Second question is,  can i work with tokenized fields?

Best regards

Jens
--
View this message in context: 
http://www.nabble.com/hitcollector-topdocs-tp16275287p16275287.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: explain() - fieldnorm

2008-03-25 Thread Grant Ingersoll


On Mar 25, 2008, at 12:10 PM, JensBurkhardt wrote:


As you can see, both are exactly the same. The thing i don't  
understand is,
that the two documents have different documentboosts (the first one  
got an
boost of 1.62 , the second of 1.65) - the boosts are different  
because the
two books got different publication years - but explain() tells me  
that my

fieldNorm value is 1.5.


Document boosts do not have much granularity due to the limited number  
of bits in the norm.  I seem to recall Yonik publishing a list of  
values at one time on the mailing list, but I can't for the life of me  
conjure the keywords to find it at the moment, as it was one a related  
topic.  Perhaps his memory is better than mine...


HTH,
Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread Otis Gospodnetic
Hi Jay,

Sorry, lapsus calami, that would be Lucene *contrib*.
Have a look:
http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Jay <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 6:15:54 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Sorry, I could not find the filter in the 2.3 API class list (core + 
contrib + test). I am not ware of lucene config file either. Could you 
please tell me where it is in 2.3 release?

Thanks!

Jay

Otis Gospodnetic wrote:
> Jay,
> 
> Have a look at Lucene config, it's all there, including tests.  This filter 
> will take a token such as "foobar" and chop it up into n-grams (e.g. foobar 
> -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram 
> size and even min and max n-gram size.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> - Original Message 
> From: Jay <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, March 25, 2008 1:32:24 PM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
> 
> Hi Uwe,
> 
> I am curious what NGramStemFilter is? Is it a combination of porter 
> stemming and word ngram identification?
> 
> Thanks!
> 
> Jay
> 
> Uwe Goetzke wrote:
>> Hi Ivan,
>> No, we do not use StandardAnalyser or StandardTokenizer.
>>
>> Most data is processed by 
>> fTextTokenStream = result = new 
>> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>> result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  
>> modified that ö -> oe
>> result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>> result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
>> //just a bigram tokenizer
>>
>> We use our own queryparser. The bigramms are searched with a tolerant phrase 
>> query, scoring in a doc the greatest bigramms clusters covering the phrase 
>> token. 
>>
>> Best Regards
>>
>> Uwe
>>
>> -Ursprüngliche Nachricht-
>> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
>> Gesendet: Freitag, 21. März 2008 16:25
>> An: java-user@lucene.apache.org
>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Hi Uwe,
>>
>> Could you tell what Analyzer do you use when you marked so big indexing 
>> speedup?
>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
>> reason is in it. You can see the pre last report in the thread "Indexing 
>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
>> that now is generated by JFlex instead of JavaCC.
>> I am asking because I noticed a great speedup in adding documents to 
>> index in our system. We have time control on this in the debug mode. NOW 
>> THEY ARE ADDED 5 TIMES FASTER!!!
>> But in the same time the total process of indexing in our case has 
>> improvement of about 8%. As our system is very big and complex I am 
>> wondering if really the whole process of indexing is reduces so 
>> remarkably and our system causes this slowdown or may be Lucene does 
>> some optimizations on the index, merges or something else and this is 
>> the reason the total process of indexing to be not so reasonably faster.
>>
>> Best Regards,
>> Ivan
>>
>>
>>
>> Uwe Goetzke wrote:
>>> This week I switched the lucene library version on one customer system.
>>> The indexing speed went down from 46m32s to 16m20s for the complete task
>>> including optimisation. Great Job!
>>> We index product catalogs from several suppliers, in this case around
>>> 56.000 product groups and 360.000 products including descriptions were
>>> indexed.
>>>
>>> Regards
>>>
>>> Uwe
>>>
>>>
>>>
>>> ---
>>> Healy Hudson GmbH - D-55252 Mainz Kastel
>>> Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
>>>
>>> Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger 
>>> sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn 
>>> Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies 
>>> bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken. 
>>> Bitte loschen Sie danach diese Email.
>>> This email is confidential. If you are not the intended recipient, you must 
>>> not disclose or use this information contained in it. If you have received 
>>> this email in error please tell us immediately by return email and delete 
>>> the document.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>> __ NOD32 2913 (20080301) Information __
>>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>>
>>

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread yu

Hi Otis,
I checked that contrib before and could not find NgramStemFilter. Am I 
missing other contrib?

Thanks for the link!

Jay

Otis Gospodnetic wrote:

Hi Jay,

Sorry, lapsus calami, that would be Lucene *contrib*.
Have a look:
http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Jay <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 6:15:54 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Sorry, I could not find the filter in the 2.3 API class list (core + 
contrib + test). I am not ware of lucene config file either. Could you 
please tell me where it is in 2.3 release?


Thanks!

Jay

Otis Gospodnetic wrote:
  

Jay,

Have a look at Lucene config, it's all there, including tests.  This filter will take a 
token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba 
ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram 
size.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Jay <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 1:32:24 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

I am curious what NGramStemFilter is? Is it a combination of porter 
stemming and word ngram identification?


Thanks!

Jay

Uwe Goetzke wrote:


Hi Ivan,
No, we do not use StandardAnalyser or StandardTokenizer.

Most data is processed by 
fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);

result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified 
that ö -> oe
result = new org.apache.lucene.analysis.LowerCaseFilter(result);
result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a 
bigram tokenizer

We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. 


Best Regards

Uwe

-Ursprüngliche Nachricht-
Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 21. März 2008 16:25

An: java-user@lucene.apache.org
Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

Could you tell what Analyzer do you use when you marked so big indexing 
speedup?
If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
reason is in it. You can see the pre last report in the thread "Indexing 
Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
that now is generated by JFlex instead of JavaCC.
I am asking because I noticed a great speedup in adding documents to 
index in our system. We have time control on this in the debug mode. NOW 
THEY ARE ADDED 5 TIMES FASTER!!!
But in the same time the total process of indexing in our case has 
improvement of about 8%. As our system is very big and complex I am 
wondering if really the whole process of indexing is reduces so 
remarkably and our system causes this slowdown or may be Lucene does 
some optimizations on the index, merges or something else and this is 
the reason the total process of indexing to be not so reasonably faster.


Best Regards,
Ivan



Uwe Goetzke wrote:
  

This week I switched the lucene library version on one customer system.
The indexing speed went down from 46m32s to 16m20s for the complete task
including optimisation. Great Job!
We index product catalogs from several suppliers, in this case around
56.000 product groups and 360.000 products including descriptions were
indexed.

Regards

Uwe



---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, 
durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


__ NOD32 2913 (20080301) Information __

This message was checked by NOD32 antivirus system.
http://www.eset.com



  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread Otis Gospodnetic
Sorry, I wrote this stuff, but forgot the naming.
Look: 
http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: yu <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 12:04:33 AM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Otis,
I checked that contrib before and could not find NgramStemFilter. Am I 
missing other contrib?
Thanks for the link!

Jay

Otis Gospodnetic wrote:
> Hi Jay,
>
> Sorry, lapsus calami, that would be Lucene *contrib*.
> Have a look:
> http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> - Original Message 
> From: Jay <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, March 25, 2008 6:15:54 PM
> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Sorry, I could not find the filter in the 2.3 API class list (core + 
> contrib + test). I am not ware of lucene config file either. Could you 
> please tell me where it is in 2.3 release?
>
> Thanks!
>
> Jay
>
> Otis Gospodnetic wrote:
>   
>> Jay,
>>
>> Have a look at Lucene config, it's all there, including tests.  This filter 
>> will take a token such as "foobar" and chop it up into n-grams (e.g. foobar 
>> -> fo oo ob ba ar would be a set of bi-grams).  You can specify the n-gram 
>> size and even min and max n-gram size.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> - Original Message 
>> From: Jay <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Tuesday, March 25, 2008 1:32:24 PM
>> Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>
>> Hi Uwe,
>>
>> I am curious what NGramStemFilter is? Is it a combination of porter 
>> stemming and word ngram identification?
>>
>> Thanks!
>>
>> Jay
>>
>> Uwe Goetzke wrote:
>> 
>>> Hi Ivan,
>>> No, we do not use StandardAnalyser or StandardTokenizer.
>>>
>>> Most data is processed by 
>>> fTextTokenStream = result = new 
>>> org.apache.lucene.analysis.WhitespaceTokenizer(reader);
>>> result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  
>>> modified that ö -> oe
>>> result = new org.apache.lucene.analysis.LowerCaseFilter(result);
>>> result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
>>> //just a bigram tokenizer
>>>
>>> We use our own queryparser. The bigramms are searched with a tolerant 
>>> phrase query, scoring in a doc the greatest bigramms clusters covering the 
>>> phrase token. 
>>>
>>> Best Regards
>>>
>>> Uwe
>>>
>>> -Ursprüngliche Nachricht-
>>> Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
>>> Gesendet: Freitag, 21. März 2008 16:25
>>> An: java-user@lucene.apache.org
>>> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>>>
>>> Hi Uwe,
>>>
>>> Could you tell what Analyzer do you use when you marked so big indexing 
>>> speedup?
>>> If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
>>> reason is in it. You can see the pre last report in the thread "Indexing 
>>> Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
>>> Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
>>> that now is generated by JFlex instead of JavaCC.
>>> I am asking because I noticed a great speedup in adding documents to 
>>> index in our system. We have time control on this in the debug mode. NOW 
>>> THEY ARE ADDED 5 TIMES FASTER!!!
>>> But in the same time the total process of indexing in our case has 
>>> improvement of about 8%. As our system is very big and complex I am 
>>> wondering if really the whole process of indexing is reduces so 
>>> remarkably and our system causes this slowdown or may be Lucene does 
>>> some optimizations on the index, merges or something else and this is 
>>> the reason the total process of indexing to be not so reasonably faster.
>>>
>>> Best Regards,
>>> Ivan
>>>
>>>
>>>
>>> Uwe Goetzke wrote:
>>>   
 This week I switched the lucene library version on one customer system.
 The indexing speed went down from 46m32s to 16m20s for the complete task
 including optimisation. Great Job!
 We index product catalogs from several suppliers, in this case around
 56.000 product groups and 360.000 products including descriptions were
 indexed.

 Regards

 Uwe



 ---
 Healy Hudson GmbH - D-55252 Mainz Kastel
 Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076

 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger 
 sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn 
 Sie diese Email durch einen Fehler be

Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-25 Thread yu

Sorry for my ignorance, I am looking for

NgramStemFilter specifically.
Are you suggesting that it's the same as NGramTokenFilter? Does it have 
stemming in it?

Thanks again.

Jay


Otis Gospodnetic wrote:

Sorry, I wrote this stuff, but forgot the naming.
Look: 
http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: yu <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 12:04:33 AM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Otis,
I checked that contrib before and could not find NgramStemFilter. Am I 
missing other contrib?

Thanks for the link!

Jay

Otis Gospodnetic wrote:
  

Hi Jay,

Sorry, lapsus calami, that would be Lucene *contrib*.
Have a look:
http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Jay <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 6:15:54 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Sorry, I could not find the filter in the 2.3 API class list (core + 
contrib + test). I am not ware of lucene config file either. Could you 
please tell me where it is in 2.3 release?


Thanks!

Jay

Otis Gospodnetic wrote:
  


Jay,

Have a look at Lucene config, it's all there, including tests.  This filter will take a 
token such as "foobar" and chop it up into n-grams (e.g. foobar -> fo oo ob ba 
ar would be a set of bi-grams).  You can specify the n-gram size and even min and max n-gram 
size.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Jay <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 1:32:24 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

I am curious what NGramStemFilter is? Is it a combination of porter 
stemming and word ngram identification?


Thanks!

Jay

Uwe Goetzke wrote:

  

Hi Ivan,
No, we do not use StandardAnalyser or StandardTokenizer.

Most data is processed by 
fTextTokenStream = result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);

result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  modified 
that ö -> oe
result = new org.apache.lucene.analysis.LowerCaseFilter(result);
result = new org.apache.lucene.analysis.NGramStemFilter(result,2); //just a 
bigram tokenizer

We use our own queryparser. The bigramms are searched with a tolerant phrase query, scoring in a doc the greatest bigramms clusters covering the phrase token. 


Best Regards

Uwe

-Ursprüngliche Nachricht-
Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 21. März 2008 16:25

An: java-user@lucene.apache.org
Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1

Hi Uwe,

Could you tell what Analyzer do you use when you marked so big indexing 
speedup?
If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
reason is in it. You can see the pre last report in the thread "Indexing 
Speed: 2.3 vs 2.2 (real world numbers)". According to the reporter Jake 
Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
that now is generated by JFlex instead of JavaCC.
I am asking because I noticed a great speedup in adding documents to 
index in our system. We have time control on this in the debug mode. NOW 
THEY ARE ADDED 5 TIMES FASTER!!!
But in the same time the total process of indexing in our case has 
improvement of about 8%. As our system is very big and complex I am 
wondering if really the whole process of indexing is reduces so 
remarkably and our system causes this slowdown or may be Lucene does 
some optimizations on the index, merges or something else and this is 
the reason the total process of indexing to be not so reasonably faster.


Best Regards,
Ivan



Uwe Goetzke wrote:
  


This week I switched the lucene library version on one customer system.
The indexing speed went down from 46m32s to 16m20s for the complete task
including optimisation. Great Job!
We index product catalogs from several suppliers, in this case around
56.000 product groups and 360.000 products including descriptions were
indexed.

Regards

Uwe



---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, 
durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie 
danach diese Email.
This email is confident

Re: random accessing term value

2008-03-25 Thread John Wang
I am not sure how term vectors would help me. Term vectors are ordered by
frequency, not in lex order. Since I know in the dictionary the terms are
ordered by lex, seems it is possible for me to randomly get the nth term in
the dictionary without having to seek to it.

Thoughts?

Thanks

-John

On Tue, Mar 25, 2008 at 11:16 AM, Erik Hatcher <[EMAIL PROTECTED]>
wrote:

>
> On Mar 25, 2008, at 1:32 PM, John Wang wrote:
> >Is there a way to random accessing term value in a field? e.g.
> >
> >in my field, content, the terms are: lucene, is, cool
> >
> >Is there a way to access content[2] -> cool?
>
> Via term vectors, or reanalysis of the field are two that come to
> mind.  Maybe other ways?
>
>Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>