FuzzyQuery - prefixLength - use with QueryParser?

2007-12-17 Thread Helmut Jarausch
Hi,

FuzzyQuery (in the 2.2.0 API) may take 3 arguments,
term , minimumSimilarity and prefixLength

Is there any syntax to specify the 3rd argument
in a query term for QueryParser?
(I haven't found any the current docs)

Many thanks for a hint,

Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FuzzyQuery - rounding bug?

2007-12-17 Thread Helmut Jarausch
Hi,

according to the LiA book the FuzzyQuery distance is computed as

1- distance / min(textlen,targetlen)

Given
def addDoc(text, writer):
doc = Document()
doc.add(Field("field", text,
  Field.Store.YES, Field.Index.TOKENIZED))
writer.addDocument(doc)

addDoc("a", writer)
addDoc("b", writer)
addDoc("aaabb", writer)
addDoc("aabbb", writer)
addDoc("a", writer)
addDoc("b", writer)
addDoc("d", writer)

query = FuzzyQuery(Term("field", "a"),0.8,0)

should find "b' since we have
distance = 1
min(textlen,targetlen) = 5

It does find it with
query = FuzzyQuery(Term("field", "a"),0.79,0)
though.

Is there a rounding error bug?

(this is with lucene-java-2.2.0-603782)

Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FuzzyQuery + QueryParser - I'm puzzled

2007-12-17 Thread Helmut Jarausch
Hi,

please help I am totally puzzled.

The same query, once with a direct call to FuzzyQuery
succeeds while the same query with QueryParser fails.

What am I missing?

Sorry, I'm using pylucene (with lucene-java-2.2.0-603782)

#!/usr/bin/python
import lucene
from lucene import *
lucene.initVM(lucene.CLASSPATH)

directory = RAMDirectory()
writer = IndexWriter(directory, WhitespaceAnalyzer(), True)
doc = Document()
doc.add(Field("field","Wolfgang Dahmen  Arnold Reusken",
  Field.Store.YES, Field.Index.TOKENIZED))
writer.addDocument(doc)

writer.optimize()
writer.close()

searcher = IndexSearcher(directory)

FQ= True
# FQ= False   # this case doesn't find anything  <+  WHY 

if  FQ :
  # this succeeds in finding the entry above
  query = FuzzyQuery(Term("field", "Damen"),0.79,0)
else :
  # this fails to find that entry
  parser= QueryParser("field",WhitespaceAnalyzer())
  query= parser.parse("Damen~0.79")

hits = searcher.search(query)
print "there are",hits.length(),"hits"
for k in range(0,hits.length()) :
  print hits.doc(k).get("field")

-- 
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query.rewrite - help me to understand it

2007-12-17 Thread qvall

So does it mean that if I my query doesn't support prefix or wild-char
queries then I don't need to use rewrite() for highlighting?
-- 
View this message in context: 
http://www.nabble.com/Query.rewrite---help-me-to-understand-it-tp14314507p14370200.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FuzzyQuery - rounding bug?

2007-12-17 Thread anjana m
how to i use lucene search to serach files of the local system

On Dec 17, 2007 2:11 PM, Helmut Jarausch <[EMAIL PROTECTED]>
wrote:

> Hi,
>
> according to the LiA book the FuzzyQuery distance is computed as
>
> 1- distance / min(textlen,targetlen)
>
> Given
> def addDoc(text, writer):
>doc = Document()
>doc.add(Field("field", text,
>  Field.Store.YES, Field.Index.TOKENIZED))
>writer.addDocument(doc)
>
> addDoc("a", writer)
> addDoc("b", writer)
> addDoc("aaabb", writer)
> addDoc("aabbb", writer)
> addDoc("a", writer)
> addDoc("b", writer)
> addDoc("d", writer)
>
> query = FuzzyQuery(Term("field", "a"),0.8,0)
>
> should find "b' since we have
> distance = 1
> min(textlen,targetlen) = 5
>
> It does find it with
> query = FuzzyQuery(Term("field", "a"),0.79,0)
> though.
>
> Is there a rounding error bug?
>
> (this is with lucene-java-2.2.0-603782)
>
> Helmut Jarausch
>
> Lehrstuhl fuer Numerische Mathematik
> RWTH - Aachen University
> D 52056 Aachen, Germany
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: FuzzyQuery + QueryParser - I'm puzzled

2007-12-17 Thread Doron Cohen
See in Lucene FAQ:
  "Are Wildcard, Prefix, and Fuzzy queries case sensitive?"

On Dec 17, 2007 11:27 AM, Helmut Jarausch <[EMAIL PROTECTED]>
wrote:

> Hi,
>
> please help I am totally puzzled.
>
> The same query, once with a direct call to FuzzyQuery
> succeeds while the same query with QueryParser fails.
>
> What am I missing?
>
> Sorry, I'm using pylucene (with lucene-java-2.2.0-603782)
>
> #!/usr/bin/python
> import lucene
> from lucene import *
> lucene.initVM(lucene.CLASSPATH)
>
> directory = RAMDirectory()
> writer = IndexWriter(directory, WhitespaceAnalyzer(), True)
> doc = Document()
> doc.add(Field("field","Wolfgang Dahmen  Arnold Reusken",
>  Field.Store.YES, Field.Index.TOKENIZED))
> writer.addDocument(doc)
>
> writer.optimize()
> writer.close()
>
> searcher = IndexSearcher(directory)
>
> FQ= True
> # FQ= False   # this case doesn't find anything  <+  WHY
>
> if  FQ :
>  # this succeeds in finding the entry above
>  query = FuzzyQuery(Term("field", "Damen"),0.79,0)
> else :
>  # this fails to find that entry
>  parser= QueryParser("field",WhitespaceAnalyzer())
>  query= parser.parse("Damen~0.79")
>
> hits = searcher.search(query)
> print "there are",hits.length(),"hits"
> for k in range(0,hits.length()) :
>  print hits.doc(k).get("field")
>
> --
> Helmut Jarausch
>
> Lehrstuhl fuer Numerische Mathematik
> RWTH - Aachen University
> D 52056 Aachen, Germany
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: FuzzyQuery + QueryParser - I'm puzzled

2007-12-17 Thread anjana m
hey i amnot bale to comple packages are not found..
i download..the luncene package..
help me..
.lucene.search.Hits;
import org.apache.lucene.search.Query;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

On Dec 17, 2007 4:34 PM, Doron Cohen <[EMAIL PROTECTED]> wrote:

> See in Lucene FAQ:
>  "Are Wildcard, Prefix, and Fuzzy queries case sensitive?"
>
> On Dec 17, 2007 11:27 AM, Helmut Jarausch <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >
> > please help I am totally puzzled.
> >
> > The same query, once with a direct call to FuzzyQuery
> > succeeds while the same query with QueryParser fails.
> >
> > What am I missing?
> >
> > Sorry, I'm using pylucene (with lucene-java-2.2.0-603782)
> >
> > #!/usr/bin/python
> > import lucene
> > from lucene import *
> > lucene.initVM(lucene.CLASSPATH)
> >
> > directory = RAMDirectory()
> > writer = IndexWriter(directory, WhitespaceAnalyzer(), True)
> > doc = Document()
> > doc.add(Field("field","Wolfgang Dahmen  Arnold Reusken",
> >  Field.Store.YES, Field.Index.TOKENIZED))
> > writer.addDocument(doc)
> >
> > writer.optimize()
> > writer.close()
> >
> > searcher = IndexSearcher(directory)
> >
> > FQ= True
> > # FQ= False   # this case doesn't find anything  <+  WHY
> >
> > if  FQ :
> >  # this succeeds in finding the entry above
> >  query = FuzzyQuery(Term("field", "Damen"),0.79,0)
> > else :
> >  # this fails to find that entry
> >  parser= QueryParser("field",WhitespaceAnalyzer())
> >  query= parser.parse("Damen~0.79")
> >
> > hits = searcher.search(query)
> > print "there are",hits.length(),"hits"
> > for k in range(0,hits.length()) :
> >  print hits.doc(k).get("field")
> >
> > --
> > Helmut Jarausch
> >
> > Lehrstuhl fuer Numerische Mathematik
> > RWTH - Aachen University
> > D 52056 Aachen, Germany
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>


Error with Remote Parallel MultiSearching

2007-12-17 Thread reeja devadas
Hi,

We are working with a web server and 10 search servers, these 10 servers
have index fragments on it. All available fragments of these search servers
are binding at their start up time. Remote Parallel MultiSearcher is used
for searching on these indices. When a search request comes, first it
lookup(Naming.lookUp), requested fragments is available or not. Then create
a list of  Searchable objects for available requested fragments. Doing
further steps of Remote Parallel MultiSearching with this searchables
objects. And the searching works properly with our environment.

But getting the below error once in a while when the search request come,
but there is no error when the requested Index fragment lookup time. Due to
the problem of anyone of the Searcherable fragments, entire search request
goes to exception. Error might be of corrupted Index in a particular
fragments, or machine problem, or remote objects problems.

at first time the error is like this

ERROR [Main Thread] 13 Dec 2007 06:45:41,229 (SearchMaster.java:249) -
caught a class java.rmi.ServerError
 with message: Error occurred in server thread; nested exception is:
java.lang.OutOfMemoryError: nativeGetNewTLA
Exception in thread "Main Thread" java.rmi.ServerError: Error occurred in
server thread; nested exception is:
java.lang.OutOfMemoryError: nativeGetNewTLA
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java
:333)
at sun.rmi.transport.Transport$1.run(Transport.java:159)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(
TCPTransport.java:535)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(
TCPTransport.java:790)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(
TCPTransport.java:649)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(
StreamRemoteCall.java:255)
at sun.rmi.transport.StreamRemoteCall.executeCall(
StreamRemoteCall.java:233)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
at org.apache.lucene.search.RemoteSearchable_Stub.rewrite(Unknown
Source)
at org.apache.lucene.search.MultiSearcher.rewrite(MultiSearcher.java
:261)
at org.apache.lucene.search.ParallelMultiSearcher.rewrite(
ParallelMultiSearcher.java:187)
at org.apache.lucene.search.Query.weight(Query.java:94)
at org.apache.lucene.search.Hits.(Hits.java:49)
at org.apache.lucene.search.Searcher.search(Searcher.java:54)
at com.sumobrain.search.SearchMaster.doSearch(SearchMaster.java:244)
at com.sumobrain.search.SearchMaster.main(SearchMaster.java:672)
Caused by: java.lang.OutOfMemoryError: nativeGetNewTLA
at sun.reflect.ByteVectorImpl.resize(ByteVectorImpl.java:66)
at sun.reflect.ByteVectorImpl.add(ByteVectorImpl.java:45)
at sun.reflect.ClassFileAssembler.emitByte(ClassFileAssembler.java
:56)
at sun.reflect.ClassFileAssembler.emitConstantPoolUTF8(
ClassFileAssembler.java:89)
at sun.reflect.AccessorGenerator.emitCommonConstantPoolEntries(
AccessorGenerator.java:123)
at sun.reflect.MethodAccessorGenerator.generate(
MethodAccessorGenerator.java:333)
at
sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(
MethodAccessorGenerator.java:95)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(
ReflectionFactory.java:313)
at java.io.ObjectStreamClass.getSerializableConstructor(
ObjectStreamClass.java:1327)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:437)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:413)
at java.io.ObjectStreamClass.lookup0(ObjectStreamClass.java:310)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java
:547)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java
:1583)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java
:1496)
at java.io.ObjectInputStream.readOrdinaryObject(
ObjectInputStream.java:1732)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java
:1329)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
at sun.rmi.server.UnicastRef.unmarshalValue(UnicastRef.java:306)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java
:290)
at sun.rmi.transport.Transport$1.run(Transport.java:159)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(
TCPTr

Re: FuzzyQuery - prefixLength - use with QueryParser?

2007-12-17 Thread Erik Hatcher


On Dec 17, 2007, at 3:31 AM, Helmut Jarausch wrote:

FuzzyQuery (in the 2.2.0 API) may take 3 arguments,
term , minimumSimilarity and prefixLength

Is there any syntax to specify the 3rd argument
in a query term for QueryParser?
(I haven't found any the current docs)



No, there isn't.  But you can set it via the API, see  
QueryParser#setFuzzyPrefixLength(int)


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query.rewrite - help me to understand it

2007-12-17 Thread Erik Hatcher


On Dec 17, 2007, at 5:14 AM, qvall wrote:

So does it mean that if I my query doesn't support prefix or wild-char
queries then I don't need to use rewrite() for highlighting?


As long as the terms you want highlighted are extractable from the  
Query instance, all is fine.


However, it wouldn't hurt to always rewrite.  Primitive queries short  
circuit the rewriting anyway, so its not as though you're burning  
much unnecessary time/IO in the rewrite call.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to say Thank You ?

2007-12-17 Thread Helmut Jarausch
Hi,

I have got invaluable help from several people of this list.
Unfortunately I couldn't guess the email of some of you.

So, many thanks to all who have helped me.
Merry Christmas and a Hapy New Year to you all.

(Perhaps someone comes up with a means to say 'thank you'
without 'polluting' the list for that and without
sacrificing privacy of the email addresses.)

Helmut.

-- 
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to say Thank You ?

2007-12-17 Thread Grant Ingersoll
I don't consider sending this kind of message to the list pollution.   
It's good to take a step back from time to time and remember that  
almost all of us volunteer here, even if we get paid to work w/  
Lucene.  I am constantly amazed at the Lucene community and what it  
has to offer in the way of ideas, support and for, lack of a better  
word, niceness.  The sentiment is definitely appreciated and it is  
good to know we have happy users.


FWIW, other ways to say thanks include:
1. Contributing docs, patches, help to others when possible.  See 
http://wiki.apache.org/lucene-java/HowToContribute
2.  The ASF gladly accepts donations, since it is a non-profit.  See 
http://www.apache.org/foundation/contributing.html

Of course, neither of these are requirements for participation, but  
since you asked, I thought I would offer.


Cheers,
Grant

On Dec 17, 2007, at 11:04 AM, Helmut Jarausch wrote:


Hi,

I have got invaluable help from several people of this list.
Unfortunately I couldn't guess the email of some of you.

So, many thanks to all who have helped me.
Merry Christmas and a Hapy New Year to you all.

(Perhaps someone comes up with a means to say 'thank you'
without 'polluting' the list for that and without
sacrificing privacy of the email addresses.)

Helmut.

--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Mike Klaas

On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:

I have a few fields that use package names and class names and I've  
been

looking for some suggestions for analyzing these fields.

A few examples -

Text (class name)
- "org.apache.lucene.document.Document"
Queries that would match
- "org.apache" , "org.apache.lucene.document"

Text (class name + method signature)
-- "org.apache.lucene.document.Document#add(Fieldable)"
Queries that would match
-- "org.apache.lucene", "org.apache.lucene.document.Document#add"

Any thoughts on how to approach tokenizing these types of texts?


Perhaps it would help to include some examples of queries you _don't_  
want to match.  For all the examples above, simply tokenizing  
alphanumeric components would suffice.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Beyer,Nathan
Good point.

I don't want the sub-package names on their own to match.

Text (class name)
 - "org.apache.lucene.document.Document"
Queries that would match
 - "org.apache", "org.apache.lucene.document"
Queries that DO NOT match
 - "apache", "lucene", "document"

-Nathan

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 17, 2007 11:29 AM
To: java-user@lucene.apache.org
Subject: Re: thoughts/suggestions for analyzing/tokenizing class names

On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:

> I have a few fields that use package names and class names and I've  
> been
> looking for some suggestions for analyzing these fields.
>
> A few examples -
>
> Text (class name)
> - "org.apache.lucene.document.Document"
> Queries that would match
> - "org.apache" , "org.apache.lucene.document"
>
> Text (class name + method signature)
> -- "org.apache.lucene.document.Document#add(Fieldable)"
> Queries that would match
> -- "org.apache.lucene", "org.apache.lucene.document.Document#add"
>
> Any thoughts on how to approach tokenizing these types of texts?

Perhaps it would help to include some examples of queries you _don't_  
want to match.  For all the examples above, simply tokenizing  
alphanumeric components would suffice.

-Mike

--
CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Phrase Query Problem

2007-12-17 Thread Sirish Vadala

I have the following code for search:

BooleanQuery bQuery = new BooleanQuery();
Query queryAuthor;
queryAuthor = new TermQuery(new Term(IFIELD_LEAD_AUTHOR,
author.trim().toLowerCase()));
bQuery.add(queryAuthor, BooleanClause.Occur.MUST);



PhraseQuery pQuery = new PhraseQuery();
String[] phrase = txtWithPhrase.toLowerCase().split(" ");
for (int i = 0; i < phrase.length; i++) {
pQuery.add(new Term(IFIELD_TEXT, phrase[i]));
}
pQuery.setSlop(0);
bQuery.add(pQuery, BooleanClause.Occur.MUST);



String[] sortOrder = {IFIELD_LEAD_AUTHOR, IFIELD_TEXT};
Sort sort = new Sort(sortOrder);
hits = indexSearcher.search(bQuery, sort);

Now My problem here is: If I do a search on a phrase with text Health
Safety, it is fetching me all the records where in the text is Health
and/or/in Safety. It is fetching me these records even after setting the
slop of the phrase query to zero for exact match. I am using standard
analyzer while indexing my records.

Any help on this is greatly appreciated. 

Sirish Vadala
-- 
View this message in context: 
http://www.nabble.com/Phrase-Query-Problem-tp14373945p14373945.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Phrase Query Problem

2007-12-17 Thread Zhang, Lisheng
Hi,

Do you mean that your query phrase is "Health Safety",
but docs with "Health and Safety" returned?

If that is the case, the reason is that StandardAnalyzer
filters out "and" (also "or, "in" and others) as stop 
words during indexing, and the QueryParser filters those
words out also.

Best regards, Lisheng

-Original Message-
From: Sirish Vadala [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 9:49 AM
To: java-user@lucene.apache.org
Subject: Phrase Query Problem



I have the following code for search:

BooleanQuery bQuery = new BooleanQuery();
Query queryAuthor;
queryAuthor = new TermQuery(new Term(IFIELD_LEAD_AUTHOR,
author.trim().toLowerCase()));
bQuery.add(queryAuthor, BooleanClause.Occur.MUST);



PhraseQuery pQuery = new PhraseQuery();
String[] phrase = txtWithPhrase.toLowerCase().split(" ");
for (int i = 0; i < phrase.length; i++) {
pQuery.add(new Term(IFIELD_TEXT, phrase[i]));
}
pQuery.setSlop(0);
bQuery.add(pQuery, BooleanClause.Occur.MUST);



String[] sortOrder = {IFIELD_LEAD_AUTHOR, IFIELD_TEXT};
Sort sort = new Sort(sortOrder);
hits = indexSearcher.search(bQuery, sort);

Now My problem here is: If I do a search on a phrase with text Health
Safety, it is fetching me all the records where in the text is Health
and/or/in Safety. It is fetching me these records even after setting the
slop of the phrase query to zero for exact match. I am using standard
analyzer while indexing my records.

Any help on this is greatly appreciated. 

Sirish Vadala
-- 
View this message in context:
http://www.nabble.com/Phrase-Query-Problem-tp14373945p14373945.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: FuzzyQuery + QueryParser - I'm puzzled

2007-12-17 Thread Steven A Rowe
Hi anjana m,

You're going to have lots of trouble getting a response, for two reasons:

1. You are replying to an existing thread and changing the subject.  Don't do 
that.  When you have a question, start a new thread by creating a new email 
instead of replying.

2. You are not telling the list what you have done and what you want to do.  
The information you provide tells us almost nothing, except that you have tried 
to use Lucene and failed.

We want to help - really.  But we can't unless you make your questions a) 
visible by not piggybacking on existing threads and b) clear by giving a full 
picture of what you want to do, what you have tried, and what happened.

Please, try again :).  Start by NOT replying to this message, but instead 
starting a new thread.

Steve

On 12/17/2007 at 6:24 AM, anjana m wrote:
> hey i amnot bale to comple packages are not found..
> i download..the luncene package..
> help me..
> .lucene.search.Hits;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.queryParser.ParseException;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> 
> On Dec 17, 2007 4:34 PM, Doron Cohen <[EMAIL PROTECTED]> wrote:
> 
> > See in Lucene FAQ:
> >  "Are Wildcard, Prefix, and Fuzzy queries case sensitive?"
> > 
> > On Dec 17, 2007 11:27 AM, Helmut Jarausch
> > <[EMAIL PROTECTED]> wrote:
> > 
> > > Hi,
> > > 
> > > please help I am totally puzzled.
> > > 
> > > The same query, once with a direct call to FuzzyQuery
> > > succeeds while the same query with QueryParser fails.
> > > 
> > > What am I missing?
> > > 
> > > Sorry, I'm using pylucene (with lucene-java-2.2.0-603782)
> > > 
> > > #!/usr/bin/python
> > > import lucene
> > > from lucene import *
> > > lucene.initVM(lucene.CLASSPATH)
> > > 
> > > directory = RAMDirectory()
> > > writer = IndexWriter(directory, WhitespaceAnalyzer(), True)
> > > doc = Document()
> > > doc.add(Field("field","Wolfgang Dahmen  Arnold Reusken",
> > >  Field.Store.YES, Field.Index.TOKENIZED))
> > > writer.addDocument(doc)
> > > 
> > > writer.optimize()
> > > writer.close()
> > > 
> > > searcher = IndexSearcher(directory)
> > > 
> > > FQ= True
> > > # FQ= False   # this case doesn't find anything  <+  WHY
> > > 
> > > if  FQ :
> > >  # this succeeds in finding the entry above query =
> > >  FuzzyQuery(Term("field", "Damen"),0.79,0) else : # this fails to find
> > >  that entry parser= QueryParser("field",WhitespaceAnalyzer()) query=
> > >  parser.parse("Damen~0.79")
> > > 
> > > hits = searcher.search(query)
> > > print "there are",hits.length(),"hits"
> > > for k in range(0,hits.length()) :
> > >  print hits.doc(k).get("field")
> > > 
> > > --
> > > Helmut Jarausch
> > > 
> > > Lehrstuhl fuer Numerische Mathematik
> > > RWTH - Aachen University
> > > D 52056 Aachen, Germany
> > > 
> > > 
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED] For
> > > additional commands, e-mail: [EMAIL PROTECTED]
> > > 
> > > 
> > 
>

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Mike Klaas

Either index them as a series of tokens:

org
org.apache
org.apache.lucene
org.apache.lucene.document
org.apache.lucene.document.Document

or index them as a single token, and use prefix queries (this is what  
I do for reverse domain names):


classname:(org.apache org.apache.*)

Note that "classname:org.apache*" would probably be wrong--you might  
not want to match


org.apache-fake.lucene.document

regards,
-Mike

On 17-Dec-07, at 9:39 AM, Beyer,Nathan wrote:


Good point.

I don't want the sub-package names on their own to match.

Text (class name)
 - "org.apache.lucene.document.Document"
Queries that would match
 - "org.apache", "org.apache.lucene.document"
Queries that DO NOT match
 - "apache", "lucene", "document"

-Nathan

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 11:29 AM
To: java-user@lucene.apache.org
Subject: Re: thoughts/suggestions for analyzing/tokenizing class names

On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:


I have a few fields that use package names and class names and I've
been
looking for some suggestions for analyzing these fields.

A few examples -

Text (class name)
- "org.apache.lucene.document.Document"
Queries that would match
- "org.apache" , "org.apache.lucene.document"

Text (class name + method signature)
-- "org.apache.lucene.document.Document#add(Fieldable)"
Queries that would match
-- "org.apache.lucene", "org.apache.lucene.document.Document#add"

Any thoughts on how to approach tokenizing these types of texts?


Perhaps it would help to include some examples of queries you _don't_
want to match.  For all the examples above, simply tokenizing
alphanumeric components would suffice.

-Mike

--
CONFIDENTIALITY NOTICE This message and any included attachments  
are from Cerner Corporation and are intended only for the  
addressee. The information contained in this message is  
confidential and may constitute inside or non-public information  
under international, federal, or state securities laws.  
Unauthorized forwarding, printing, copying, distribution, or use of  
such information is strictly prohibited and may be unlawful. If you  
are not the addressee, please promptly delete this message and  
notify the sender of the delivery error by e-mail or you may call  
Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)  
(816)221-1024.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FuzzyQuery - rounding bug?

2007-12-17 Thread Erick Erickson
Please do not highack the thread. When starting a new topic, do NOT
use "reply to", start an entirely new e-mail. Otherwise your topic often
gets ignored by people who are uninterested in the original thread.

Best
Erick

On Dec 17, 2007 5:57 AM, anjana m <[EMAIL PROTECTED]> wrote:

> how to i use lucene search to serach files of the local system
>
> On Dec 17, 2007 2:11 PM, Helmut Jarausch <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >
> > according to the LiA book the FuzzyQuery distance is computed as
> >
> > 1- distance / min(textlen,targetlen)
> >
> > Given
> > def addDoc(text, writer):
> >doc = Document()
> >doc.add(Field("field", text,
> >  Field.Store.YES, Field.Index.TOKENIZED))
> >writer.addDocument(doc)
> >
> > addDoc("a", writer)
> > addDoc("b", writer)
> > addDoc("aaabb", writer)
> > addDoc("aabbb", writer)
> > addDoc("a", writer)
> > addDoc("b", writer)
> > addDoc("d", writer)
> >
> > query = FuzzyQuery(Term("field", "a"),0.8,0)
> >
> > should find "b' since we have
> > distance = 1
> > min(textlen,targetlen) = 5
> >
> > It does find it with
> > query = FuzzyQuery(Term("field", "a"),0.79,0)
> > though.
> >
> > Is there a rounding error bug?
> >
> > (this is with lucene-java-2.2.0-603782)
> >
> > Helmut Jarausch
> >
> > Lehrstuhl fuer Numerische Mathematik
> > RWTH - Aachen University
> > D 52056 Aachen, Germany
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>


RE: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Beyer,Nathan
Would using Field.Index.UN_TOKENIZED be the same as tokenizing a field
into one token?

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 17, 2007 12:53 PM
To: java-user@lucene.apache.org
Subject: Re: thoughts/suggestions for analyzing/tokenizing class names

Either index them as a series of tokens:

org
org.apache
org.apache.lucene
org.apache.lucene.document
org.apache.lucene.document.Document

or index them as a single token, and use prefix queries (this is what  
I do for reverse domain names):

classname:(org.apache org.apache.*)

Note that "classname:org.apache*" would probably be wrong--you might  
not want to match

org.apache-fake.lucene.document

regards,
-Mike

On 17-Dec-07, at 9:39 AM, Beyer,Nathan wrote:

> Good point.
>
> I don't want the sub-package names on their own to match.
>
> Text (class name)
>  - "org.apache.lucene.document.Document"
> Queries that would match
>  - "org.apache", "org.apache.lucene.document"
> Queries that DO NOT match
>  - "apache", "lucene", "document"
>
> -Nathan
>
> -Original Message-
> From: Mike Klaas [mailto:[EMAIL PROTECTED]
> Sent: Monday, December 17, 2007 11:29 AM
> To: java-user@lucene.apache.org
> Subject: Re: thoughts/suggestions for analyzing/tokenizing class names
>
> On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:
>
>> I have a few fields that use package names and class names and I've
>> been
>> looking for some suggestions for analyzing these fields.
>>
>> A few examples -
>>
>> Text (class name)
>> - "org.apache.lucene.document.Document"
>> Queries that would match
>> - "org.apache" , "org.apache.lucene.document"
>>
>> Text (class name + method signature)
>> -- "org.apache.lucene.document.Document#add(Fieldable)"
>> Queries that would match
>> -- "org.apache.lucene", "org.apache.lucene.document.Document#add"
>>
>> Any thoughts on how to approach tokenizing these types of texts?
>
> Perhaps it would help to include some examples of queries you _don't_
> want to match.  For all the examples above, simply tokenizing
> alphanumeric components would suffice.
>
> -Mike
>
> --
> CONFIDENTIALITY NOTICE This message and any included attachments  
> are from Cerner Corporation and are intended only for the  
> addressee. The information contained in this message is  
> confidential and may constitute inside or non-public information  
> under international, federal, or state securities laws.  
> Unauthorized forwarding, printing, copying, distribution, or use of  
> such information is strictly prohibited and may be unlawful. If you  
> are not the addressee, please promptly delete this message and  
> notify the sender of the delivery error by e-mail or you may call  
> Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)  
> (816)221-1024.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Phrase Query Problem

2007-12-17 Thread Zhang, Lisheng
Hi Sirish,

A few hours ago I sent a reply to your message, if my
understanding is correct, you indexed a doc with text
as

Health and Safety

and you used phrase 

Health Safety

to create a phrase query. If that is the case, this is
normal since you used StandardAnalyzer to tokenize the
input text, so certain "stop words" are filtered out, the
complete list of filtered out words are (from lucene 20
source code StopAnalyzer):

###
  "a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
###

The end result is the same as your input text is 

Health Safety

Best regards, Lisheng


-Original Message-
From: Sirish Vadala [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 10:38 AM
To: java-user@lucene.apache.org
Subject: Phrase Query Problem



I have the following code for search:

BooleanQuery bQuery = new BooleanQuery();
Query queryAuthor;
queryAuthor = new TermQuery(new Term(IFIELD_LEAD_AUTHOR,
author.trim().toLowerCase()));
bQuery.add(queryAuthor, BooleanClause.Occur.MUST);



PhraseQuery pQuery = new PhraseQuery();
String[] phrase = txtWithPhrase.toLowerCase().split(" ");
for (int i = 0; i < phrase.length; i++) {
pQuery.add(new Term(IFIELD_TEXT, phrase[i]));
}
pQuery.setSlop(0);
bQuery.add(pQuery, BooleanClause.Occur.MUST);



String[] sortOrder = {IFIELD_LEAD_AUTHOR, IFIELD_TEXT};
Sort sort = new Sort(sortOrder);
hits = indexSearcher.search(bQuery, sort);

Now My problem here is: If I do a search on a phrase with text Health
Safety, it is fetching me all the records where in the text is Health
and/or/in Safety. It is fetching me these records even after setting the
slop of the phrase query to zero for exact match. I am using standard
analyzer while indexing my records.

Any help on this is greatly appreciated. 

Sirish Vadala
-- 
View this message in context:
http://www.nabble.com/Phrase-Query-Problem-tp14373945p14373945.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Analyzer to use with MultiSearcher using various indexes for multiple languages

2007-12-17 Thread Jay Hill
I'm working on a project where we are indexing content for several different
languages - English, Spanish, French and German. I have built separate
indexes for each language using the proper Analyzer for each
language(StandardAnalyzer for English, FrenchAnalyzer for French, etc.). We
have a requirement to search across multiple languages, so I'm planning to
use MultiSearcher, passing an array of all IndexSearchers for each language.
But I'm not sure how to set up an Analyzer for the IndexSearchers properly.
Any pointers about how to set up the MultiSearcher for use on indexes in
multiple languages would be appreciated.

-Jay


Re: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Mike Klaas


On 17-Dec-07, at 11:39 AM, Beyer,Nathan wrote:


Would using Field.Index.UN_TOKENIZED be the same as tokenizing a field
into one token?


Indeed.

-Mike



-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 12:53 PM
To: java-user@lucene.apache.org
Subject: Re: thoughts/suggestions for analyzing/tokenizing class names

Either index them as a series of tokens:

org
org.apache
org.apache.lucene
org.apache.lucene.document
org.apache.lucene.document.Document

or index them as a single token, and use prefix queries (this is what
I do for reverse domain names):

classname:(org.apache org.apache.*)

Note that "classname:org.apache*" would probably be wrong--you might
not want to match

org.apache-fake.lucene.document

regards,
-Mike

On 17-Dec-07, at 9:39 AM, Beyer,Nathan wrote:


Good point.

I don't want the sub-package names on their own to match.

Text (class name)
 - "org.apache.lucene.document.Document"
Queries that would match
 - "org.apache", "org.apache.lucene.document"
Queries that DO NOT match
 - "apache", "lucene", "document"

-Nathan

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 11:29 AM
To: java-user@lucene.apache.org
Subject: Re: thoughts/suggestions for analyzing/tokenizing class  
names


On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:


I have a few fields that use package names and class names and I've
been
looking for some suggestions for analyzing these fields.

A few examples -

Text (class name)
- "org.apache.lucene.document.Document"
Queries that would match
- "org.apache" , "org.apache.lucene.document"

Text (class name + method signature)
-- "org.apache.lucene.document.Document#add(Fieldable)"
Queries that would match
-- "org.apache.lucene", "org.apache.lucene.document.Document#add"

Any thoughts on how to approach tokenizing these types of texts?


Perhaps it would help to include some examples of queries you _don't_
want to match.  For all the examples above, simply tokenizing
alphanumeric components would suffice.

-Mike

- 
-

CONFIDENTIALITY NOTICE This message and any included attachments
are from Cerner Corporation and are intended only for the
addressee. The information contained in this message is
confidential and may constitute inside or non-public information
under international, federal, or state securities laws.
Unauthorized forwarding, printing, copying, distribution, or use of
such information is strictly prohibited and may be unlawful. If you
are not the addressee, please promptly delete this message and
notify the sender of the delivery error by e-mail or you may call
Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)
(816)221-1024.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
CONFIDENTIALITY NOTICE This message and any included attachments  
are from Cerner Corporation and are intended only for the  
addressee. The information contained in this message is  
confidential and may constitute inside or non-public information  
under international, federal, or state securities laws.  
Unauthorized forwarding, printing, copying, distribution, or use of  
such information is strictly prohibited and may be unlawful. If you  
are not the addressee, please promptly delete this message and  
notify the sender of the delivery error by e-mail or you may call  
Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)  
(816)221-1024.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Infrastructure question

2007-12-17 Thread v k
Hello,

I am using Lucene to build an index from roughly  10 million documents
in number. The  documents are about 4 TB in total.

After some trial runs, indexing a subset of the documents I am trying
to figure out a hosting service configuration to create a full index
from the entire 10 TB of data. As I am still unsure how this project
will turn out I am not purchasing hardware/ram but considering a web
host.
for the purpose of :
1)  download the data and to start indexing it.
2) The web front end to access this index will be a python framework (
eg. Django  etc)

I am seriously contemplating signing up with Joyent for this plan:
AMD Opteron x64 multi-core servers with 4GiB RAM per core
1/16 (Burstable up to 95%)
1 TB- Bandwidth/month, 1 GB RAM, + as such as NAS  storage as I can
afford to pay for.

My QUESTION is - Will this RAM and CPU be sufficient during
development of the search application and building the index, etc. or
is it so abysmal and under-equipped in terms of hardware that the
development version of my application will not work.
I understand that having more RAM is always good, but is 1GB as good as nothing?

This setup is NOT for production but for for development so I can get
my hands dirty with lucene which will require plenty of tweaks as the
project moves along.

What initial configuration would you recommend for a development
version given the corpus size. I am not even sure how large my index
will look like at this point.

I hope to build an my indexes this way and once the search
infrastructure is working and the web-front end complete, I plan to
worry about Redundancy, availability and scalability for the many
users I hope to provide this free service for :-)

Many of you in this forum have built successful products with Lucene.
To name a few I am aware of -  Ken Krugle, James Ryley, Dennis Kubes

Some of you must have started with small machines,test set-ups etc
where you built your initial search apps. I hope  to receive some
advise about my plan and approach to start building an infrastructure
to support my Lucene app.

Thank you.

Venkat

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]