Re: How to create a Lucene in-memory index at webapp deployment time

2012-09-10 Thread Jack Krupansky

Follow the instruction here:
http://lucene.apache.org/core/discussion.html

-- Jack Krupansky

-Original Message- 
From: Noopur Julka

Sent: Monday, September 10, 2012 12:43 PM
To: java-user@lucene.apache.org
Cc: Dhananjeyan Balaretnaraja
Subject: Re: How to create a Lucene in-memory index at webapp deployment 
time


unsuscribe me pls

Regards,
Noopur Julka



On Fri, Sep 7, 2012 at 10:40 AM, Kasun Perera  wrote:


I have a web java/jsp application running on Apache Tomcat server. In this
web application I have used lucene, to index and calculate similrarity
between some PDF documents(PDF documents are in the database). My live
server dosent allow web-app to access files, so I have created the
in-memory lucene index using RAMDirectory class.

In the current way that I have coded in my application, when for each time
user access the lucene involved functionality, it creates a new in-memory
index.

Is there any way to create the in-memory index at the webapp deployment
time, so that in-memory index will be created only once and I can access
in-memory index as long as web app is live?

--
Regards

Kasun Perera




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hibernate Search with Regex based on Table

2012-09-12 Thread Jack Krupansky
It sounds as if MappingCharFilter would be sufficient. Unless there is some 
additional requirement?


In Solr we have:
positionIncrementGap="100" >

 
   mapping="mapping-ISOLatin1Accent.txt"/>

   
 


That mapping-ISOLatin1Accent.txt file maps or "folds" all the accented 
characters into the base ASCII letter.


-- Jack Krupansky

-Original Message- 
From: Robert Streitberger

Sent: Wednesday, September 12, 2012 8:45 AM
To: java-user@lucene.apache.org
Subject: Hibernate Search with Regex based on Table

Hello,

I am currently discussing the possibilities of introducing Hibernate
Search (Lucene) into an existing Java Web Project with existing Hibernate
Layer.

Hibernate Queries are quite complex and mostly done with criteries.

For certain properties/columns we are looking for advanced search
possibilities.

Example: Assume we have a where clause with like search looking up for
names from different languages (we are on UTF-8 database) like let's say
Gomez -> which could also be written as Gómez or Gômez... what ever...

The idea for the search is to hava a table which provides all alternatives
for a certain letter... let's say o -> ô, ó, ò, ... and creating a regex
from this to find all possible combinations of Gomez no matter if we use
o, or variants of it from utf-8 character set. Problem is that regex can
be very large as there are alternatives for nearly any vocals and
consonants and regexp_like search of oracle database is quite restricted.

Thus idea would be to use some kind of index search with lucene.

In short: Would it be possible to introduce Hibernate Search in the
project? (There is at least hibernate 3.0 and Jdk 1.5 on tomcat 6 with
hbm.xml files available but not with annotations).
   Would it be possible to use indexed lucene search by
adding Restrictions to Hibernate Criterias?
   Would it be possible to also introduce the matching table
to create a complex regex?
  Or is there a restriction on the length of lucene regex
expressions?
 Or is there maybe another way which is not using regex at
all if regex is not possible with this complexity?


Many thanks in advance!
kr


Rob 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hibernate Search with Regex based on Table

2012-09-12 Thread Jack Krupansky
MappingCharFilter  can do all of that. The file I referenced already has ae, 
oe, and ss. That default file handles your umlauts differently, but you can 
change the rules to suit your exact needs.


-- Jack Krupansky

-Original Message- 
From: Robert Streitberger

Sent: Wednesday, September 12, 2012 9:22 AM
To: java-user@lucene.apache.org
Subject: Re: Hibernate Search with Regex based on Table

Hi,

thx for the hint. It seems to be an interesting solution.
Unfortunately I think it will come to problems with german names when
umlauts (ö, ä) and the sharp s (ß) are mapped, because there are some
requirements to map these chars to the usual german representation and
consider this in search. let's say oe, ae, ss.

kr
Rob



From:   "Jack Krupansky" 
To: 
Date:   12.09.2012 15:02
Subject:Re: Hibernate Search with Regex based on Table



It sounds as if MappingCharFilter would be sufficient. Unless there is
some
additional requirement?

In Solr we have:

 
   
   
 


That mapping-ISOLatin1Accent.txt file maps or "folds" all the accented
characters into the base ASCII letter.

-- Jack Krupansky

-Original Message- 
From: Robert Streitberger

Sent: Wednesday, September 12, 2012 8:45 AM
To: java-user@lucene.apache.org
Subject: Hibernate Search with Regex based on Table

Hello,

I am currently discussing the possibilities of introducing Hibernate
Search (Lucene) into an existing Java Web Project with existing Hibernate
Layer.

Hibernate Queries are quite complex and mostly done with criteries.

For certain properties/columns we are looking for advanced search
possibilities.

Example: Assume we have a where clause with like search looking up for
names from different languages (we are on UTF-8 database) like let's say
Gomez -> which could also be written as Gómez or Gômez... what ever...

The idea for the search is to hava a table which provides all alternatives
for a certain letter... let's say o -> ô, ó, ò, ... and creating a regex
from this to find all possible combinations of Gomez no matter if we use
o, or variants of it from utf-8 character set. Problem is that regex can
be very large as there are alternatives for nearly any vocals and
consonants and regexp_like search of oracle database is quite restricted.

Thus idea would be to use some kind of index search with lucene.

In short: Would it be possible to introduce Hibernate Search in the
project? (There is at least hibernate 3.0 and Jdk 1.5 on tomcat 6 with
hbm.xml files available but not with annotations).
   Would it be possible to use indexed lucene search by
adding Restrictions to Hibernate Criterias?
   Would it be possible to also introduce the matching table
to create a complex regex?
  Or is there a restriction on the length of lucene regex
expressions?
 Or is there maybe another way which is not using regex at
all if regex is not possible with this complexity?


Many thanks in advance!
kr


Rob


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 GA time frame

2012-09-14 Thread Jack Krupansky
My personal estimate is that it will likely be within a week or two, but 
there is no official date.


-- Jack Krupansky

-Original Message- 
From: sausarkar

Sent: Friday, September 14, 2012 1:05 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 GA time frame

Now that the BETA is out for a while any clue when the GA release wil take
place?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-4-0-GA-time-frame-tp3998264p4007802.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to fully preprocess query before fuzzy search?

2012-09-17 Thread Jack Krupansky

"
Lucene supports escaping special characters that are part of the query 
syntax. The current list special characters are


+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
"

See:
http://lucene.apache.org/core/4_0_0-ALPHA/queryparser/org/apache/lucene/queryparser/classic/package-summary.html

So, maybe you should escape all special characters, and then add the fuzzy 
query. Note: In 4.0 the fuzzy query is limited to an editing distance of 2.


-- Jack Krupansky

-Original Message- 
From: Ilya Zavorin

Sent: Monday, September 17, 2012 10:41 AM
To: java-user@lucene.apache.org
Subject: how to fully preprocess query before fuzzy search?

I am processing a bunch of text coming out of OCR, i.e. it's 
machine-generated text that contains some errors like garbage characters 
attached to words, letters replaced with similarly looking characters (e.g. 
"I" with "1") etc. The text is whitespace-tokenized and I am trying to match 
each token against an index using a fuzzy match, so that small amounts of 
occasional garbage in the tokens do not prevent a match.


Right now I am preprocessing each query as follows:

//term = token
Query queryF = parser.Parse(term.Replace("~", "") + "~");

However, searcher.Search still throws "can't parse" exceptions for queries 
that contain brackets, quotes and other garbage characters.


So how should I fully preprocess a query to avoid these exceptions?

Looks like I just need to remove a certain set of characters just like the 
tilde is removed above. What is the complete set of such characters? Do I 
need to do any other preprocess?


Thanks,

Ilya Zavorin


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to fully preprocess query before fuzzy search?

2012-09-17 Thread Jack Krupansky
Either is fine. In fact just escape based on the individual character, not 
the context. The multi-character context is telling you places where escape 
is not essential, but that doesn't mean it would hurt.


-- Jack Krupansky

-Original Message- 
From: Ilya Zavorin

Sent: Monday, September 17, 2012 11:08 AM
To: java-user@lucene.apache.org
Subject: RE: how to fully preprocess query before fuzzy search?

Thanks so I do not need to escape the "&" in

"dog & cat"

But I do need to escape the "&&" in

"dog && cat"

correct? And do I escape as "dog \&& cat" or as "dog \&\& cat"?


Ilya


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Monday, September 17, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Re: how to fully preprocess query before fuzzy search?

"
Lucene supports escaping special characters that are part of the query 
syntax. The current list special characters are


+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
"

See:
http://lucene.apache.org/core/4_0_0-ALPHA/queryparser/org/apache/lucene/queryparser/classic/package-summary.html

So, maybe you should escape all special characters, and then add the fuzzy 
query. Note: In 4.0 the fuzzy query is limited to an editing distance of 2.


-- Jack Krupansky

-Original Message-
From: Ilya Zavorin
Sent: Monday, September 17, 2012 10:41 AM
To: java-user@lucene.apache.org
Subject: how to fully preprocess query before fuzzy search?

I am processing a bunch of text coming out of OCR, i.e. it's 
machine-generated text that contains some errors like garbage characters 
attached to words, letters replaced with similarly looking characters (e.g.
"I" with "1") etc. The text is whitespace-tokenized and I am trying to match 
each token against an index using a fuzzy match, so that small amounts of 
occasional garbage in the tokens do not prevent a match.


Right now I am preprocessing each query as follows:

//term = token
Query queryF = parser.Parse(term.Replace("~", "") + "~");

However, searcher.Search still throws "can't parse" exceptions for queries 
that contain brackets, quotes and other garbage characters.


So how should I fully preprocess a query to avoid these exceptions?

Looks like I just need to remove a certain set of characters just like the 
tilde is removed above. What is the complete set of such characters? Do I 
need to do any other preprocess?


Thanks,

Ilya Zavorin


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to disable the field cache

2012-09-18 Thread Jack Krupansky
Could you suggest the code for a "mock field cache"? I mean, what would an 
anonymous instance look like.


-- Jack Krupansky

-Original Message- 
From: karsten-s...@gmx.de

Sent: Tuesday, September 18, 2012 9:07 AM
To: java-user@lucene.apache.org
Subject: Re: how to disable the field cache

Hi 惠达 王,

if you do not sort (by field values) and do not use faceted search or joins 
the field cache will not be used.

If you want to go sure:
Write a MockFieldCache and call FieldCache.DEFAULT =  new MockFieldCache();

Best regards
 Karsten

btw. the are other objects in the main-memory like doc frequency

in context:
http://lucene.472066.n3.nabble.com/how-to-disable-the-field-cache-td4007384.html

 Original-Nachricht 

Datum: Thu, 13 Sep 2012 15:34:08 +0800
Von: "惠达 王" 
An: java-user@lucene.apache.org
Betreff: how to disable the field cache



Hi all

how to disable the field cache?


王惠达 (PHP开发工程师, Sysdev Team)
-
分机:8836
QQ:429335915
mobile: 13795449454
E-mail: williamw...@anjuke.com
上海市浦东新区陆家嘴环路166号未来资产大厦10楼




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using stop words with snowball analyzer and shingle filter

2012-09-19 Thread Jack Krupansky
The underscores are due to the fact that the StopFilter defaults to "enable 
position increments", so there are no terms at the positions where the stop 
words appeared in the source text.


Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is 
"final" so you can't subclass it to override the "createComponents" method 
that creates the StopFilter, so you would essentially have to copy the 
source for SnowballAnalyzer and then add in the code to invoke 
StopFilter.setEnablePositionIncrements the way StopFilterFactory does.


-- Jack Krupansky

-Original Message- 
From: Martin O'Shea

Sent: Wednesday, September 19, 2012 4:24 AM
To: java-user@lucene.apache.org
Subject: Using stop words with snowball analyzer and shingle filter

I'm currently giving the user an option to include stop words or not when
filtering a body of text for ngram frequencies. Typically, this is done as
follows:



snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English",
stopWords);

shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer,
this.getnGramLength());



stopWords is set to either a full list of words to include in ngrams or to
remove from them. this.getnGramLength()); simply contains the current ngram
length up to a maximum of three.



If I use stopwords in filtering text "satellite is definitely falling to
Earth" for trigrams, the output is:



No=1, Key=to, Freq=1

No=2, Key=definitely, Freq=1

No=3, Key=falling to earth, Freq=1

No=4, Key=satellite, Freq=1

No=5, Key=is, Freq=1

No=6, Key=definitely falling to, Freq=1

No=7, Key=definitely falling, Freq=1

No=8, Key=falling, Freq=1

No=9, Key=to earth, Freq=1

No=10, Key=satellite is, Freq=1

No=11, Key=is definitely, Freq=1

No=12, Key=falling to, Freq=1

No=13, Key=is definitely falling, Freq=1

No=14, Key=earth, Freq=1

No=15, Key=satellite is definitely, Freq=1



But if I don't use stopwords for trigrams , the output is this:



No=1, Key=satellite, Freq=1

No=2, Key=falling _, Freq=1

No=3, Key=satellite _ _, Freq=1

No=4, Key=_ earth, Freq=1

No=5, Key=falling, Freq=1

No=6, Key=satellite _, Freq=1

No=7, Key=_ _, Freq=1

No=8, Key=_ falling _, Freq=1

No=9, Key=falling _ earth, Freq=1

No=10, Key=_, Freq=3

No=11, Key=earth, Freq=1

No=12, Key=_ _ falling, Freq=1

No=13, Key=_ falling, Freq=1



Why am I seeing underscores? I would have thought to see simple unigrams,
"satellite falling" and "falling earth", and "satellite falling earth"?








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching for a search string containing a literal slash doesn't work with QueryParser

2012-10-01 Thread Jack Krupansky
The scape merely assures that the slash will not be parsed as query syntax 
and will be passed directly to the analyzer, but the standard analyzer will 
in fact always remove it. Maybe you want the white space analyzer or keyword 
analyzer (no characters removed.)


-- Jack Krupansky

-Original Message- 
From: Jochen Hebbrecht

Sent: Monday, October 01, 2012 8:59 AM
To: java-user@lucene.apache.org
Subject: Searching for a search string containing a literal slash doesn't 
work with QueryParser


Hi,

I'm currently trying to search on the following search string in my Lucene
index: "2012/0.124.323".
The java code to search for ('value' is my search string)


QueryParser queryParser = new QueryParser(Version.LUCENE_36, field, new
StandardAnalyzer(Version.LUCENE_36));
queryParser.setAllowLeadingWildcard(true);
return queryParser.parse(value);


This returns a query result: "2012" "0.124.323". QueryParser is replacing
the forward slash by a space.
I tried escaping the "/" with a backslash "\", but this doesn't work either.

Maybe required to fully understand my scenario. I have the following import
XML:

---
...
Vervaldag 
17/07/12
09/07/12
2012/0.124.323
Kapitaals
...
---

I get all TEXT values with an XPath expression and I index them as:

---
XPathExpression expr = xpath.compile("//TEXT");
Object result = expr.evaluate(document, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
   doc.add(new org.apache.lucene.document.Field("IMAGE",
nodes.item(i).getFirstChild().getNodeValue(), Store.NO, Index.ANALYZED));
}
---

I'm using the StandardAnalyzer.

What is the best way to solve my issue? Do I need to switch from Analyzer?
Do I have to use something else then QueryParser? ...
I also want to support searching on 2012/0.*, so I cannot only use
TermQuery ...

Kind regards,
Jochen 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching for a search string containing a literal slash doesn't work with QueryParser

2012-10-01 Thread Jack Krupansky

That's "The escape merely..."

-- Jack Krupansky

-Original Message----- 
From: Jack Krupansky

Sent: Monday, October 01, 2012 9:58 AM
To: java-user@lucene.apache.org
Subject: Re: Searching for a search string containing a literal slash 
doesn't work with QueryParser


The scape merely assures that the slash will not be parsed as query syntax
and will be passed directly to the analyzer, but the standard analyzer will
in fact always remove it. Maybe you want the white space analyzer or keyword
analyzer (no characters removed.)

-- Jack Krupansky

-Original Message- 
From: Jochen Hebbrecht

Sent: Monday, October 01, 2012 8:59 AM
To: java-user@lucene.apache.org
Subject: Searching for a search string containing a literal slash doesn't
work with QueryParser

Hi,

I'm currently trying to search on the following search string in my Lucene
index: "2012/0.124.323".
The java code to search for ('value' is my search string)


QueryParser queryParser = new QueryParser(Version.LUCENE_36, field, new
StandardAnalyzer(Version.LUCENE_36));
queryParser.setAllowLeadingWildcard(true);
return queryParser.parse(value);


This returns a query result: "2012" "0.124.323". QueryParser is replacing
the forward slash by a space.
I tried escaping the "/" with a backslash "\", but this doesn't work either.

Maybe required to fully understand my scenario. I have the following import
XML:

---
...
Vervaldag 
17/07/12
09/07/12
2012/0.124.323
Kapitaals
...
---

I get all TEXT values with an XPath expression and I index them as:

---
XPathExpression expr = xpath.compile("//TEXT");
Object result = expr.evaluate(document, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
   doc.add(new org.apache.lucene.document.Field("IMAGE",
nodes.item(i).getFirstChild().getNodeValue(), Store.NO, Index.ANALYZED));
}
---

I'm using the StandardAnalyzer.

What is the best way to solve my issue? Do I need to switch from Analyzer?
Do I have to use something else then QueryParser? ...
I also want to support searching on 2012/0.*, so I cannot only use
TermQuery ...

Kind regards,
Jochen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching for a search string containing a literal slash doesn't work with QueryParser

2012-10-01 Thread Jack Krupansky
You can apply the lower case filter to the whitespace or other analyzer and 
use that as the analyzer.


-- Jack Krupansky

-Original Message- 
From: Jochen Hebbrecht

Sent: Monday, October 01, 2012 10:34 AM
To: java-user@lucene.apache.org
Subject: Re: Searching for a search string containing a literal slash 
doesn't work with QueryParser


Hi Jack,

I tried analyzing through WhitespaceAnalyzer. Now I can search on my query
string AND I can find my document! Great!
But all my searches are now case sensitive. So when I index a field as
"JavaOne", I also have to enter in my search word: "JavaOne" and not
"javaone" or "javaOne".

How do you solve this in a proper way? Bringing all characters
toLowerCase() when indexing them?

Jochen


2012/10/1 Jack Krupansky 


That's "The escape merely..."

-- Jack Krupansky

-Original Message- From: Jack Krupansky
Sent: Monday, October 01, 2012 9:58 AM
To: java-user@lucene.apache.org
Subject: Re: Searching for a search string containing a literal slash
doesn't work with QueryParser


The scape merely assures that the slash will not be parsed as query syntax
and will be passed directly to the analyzer, but the standard analyzer 
will

in fact always remove it. Maybe you want the white space analyzer or
keyword
analyzer (no characters removed.)

-- Jack Krupansky

-Original Message- From: Jochen Hebbrecht
Sent: Monday, October 01, 2012 8:59 AM
To: java-user@lucene.apache.org
Subject: Searching for a search string containing a literal slash doesn't
work with QueryParser

Hi,

I'm currently trying to search on the following search string in my Lucene
index: "2012/0.124.323".
The java code to search for ('value' is my search string)


QueryParser queryParser = new QueryParser(Version.LUCENE_36, field, new
StandardAnalyzer(Version.**LUCENE_36));
queryParser.**setAllowLeadingWildcard(true);
return queryParser.parse(value);


This returns a query result: "2012" "0.124.323". QueryParser is replacing
the forward slash by a space.
I tried escaping the "/" with a backslash "\", but this doesn't work
either.

Maybe required to fully understand my scenario. I have the following 
import

XML:

---
...
Vervaldag 
17/07/12
09/07/12
2012/0.124.323
Kapitaals
...
---

I get all TEXT values with an XPath expression and I index them as:

---
XPathExpression expr = xpath.compile("//TEXT");
Object result = expr.evaluate(document, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
   doc.add(new org.apache.lucene.document.**Field("IMAGE",
nodes.item(i).getFirstChild().**getNodeValue(), Store.NO,
Index.ANALYZED));
}
---

I'm using the StandardAnalyzer.

What is the best way to solve my issue? Do I need to switch from Analyzer?
Do I have to use something else then QueryParser? ...
I also want to support searching on 2012/0.*, so I cannot only use
TermQuery ...

Kind regards,
Jochen


--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org


--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching for a search string containing a literal slash doesn't work with QueryParser

2012-10-01 Thread Jack Krupansky

Sorry, I meant apply the filter to the TOKENIZER that the analyzer uses.

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Monday, October 01, 2012 10:44 AM
To: java-user@lucene.apache.org
Subject: Re: Searching for a search string containing a literal slash 
doesn't work with QueryParser


You can apply the lower case filter to the whitespace or other analyzer and
use that as the analyzer.

-- Jack Krupansky

-Original Message- 
From: Jochen Hebbrecht

Sent: Monday, October 01, 2012 10:34 AM
To: java-user@lucene.apache.org
Subject: Re: Searching for a search string containing a literal slash
doesn't work with QueryParser

Hi Jack,

I tried analyzing through WhitespaceAnalyzer. Now I can search on my query
string AND I can find my document! Great!
But all my searches are now case sensitive. So when I index a field as
"JavaOne", I also have to enter in my search word: "JavaOne" and not
"javaone" or "javaOne".

How do you solve this in a proper way? Bringing all characters
toLowerCase() when indexing them?

Jochen


2012/10/1 Jack Krupansky 


That's "The escape merely..."

-- Jack Krupansky

-Original Message- From: Jack Krupansky
Sent: Monday, October 01, 2012 9:58 AM
To: java-user@lucene.apache.org
Subject: Re: Searching for a search string containing a literal slash
doesn't work with QueryParser


The scape merely assures that the slash will not be parsed as query syntax
and will be passed directly to the analyzer, but the standard analyzer 
will

in fact always remove it. Maybe you want the white space analyzer or
keyword
analyzer (no characters removed.)

-- Jack Krupansky

-Original Message- From: Jochen Hebbrecht
Sent: Monday, October 01, 2012 8:59 AM
To: java-user@lucene.apache.org
Subject: Searching for a search string containing a literal slash doesn't
work with QueryParser

Hi,

I'm currently trying to search on the following search string in my Lucene
index: "2012/0.124.323".
The java code to search for ('value' is my search string)


QueryParser queryParser = new QueryParser(Version.LUCENE_36, field, new
StandardAnalyzer(Version.**LUCENE_36));
queryParser.**setAllowLeadingWildcard(true);
return queryParser.parse(value);


This returns a query result: "2012" "0.124.323". QueryParser is replacing
the forward slash by a space.
I tried escaping the "/" with a backslash "\", but this doesn't work
either.

Maybe required to fully understand my scenario. I have the following 
import

XML:

---
...
Vervaldag 
17/07/12
09/07/12
2012/0.124.323
Kapitaals
...
---

I get all TEXT values with an XPath expression and I index them as:

---
XPathExpression expr = xpath.compile("//TEXT");
Object result = expr.evaluate(document, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
   doc.add(new org.apache.lucene.document.**Field("IMAGE",
nodes.item(i).getFirstChild().**getNodeValue(), Store.NO,
Index.ANALYZED));
}
---

I'm using the StandardAnalyzer.

What is the best way to solve my issue? Do I need to switch from Analyzer?
Do I have to use something else then QueryParser? ...
I also want to support searching on 2012/0.*, so I cannot only use
TermQuery ...

Kind regards,
Jochen


--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org


--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: understanding the need to reindex a document

2012-10-22 Thread Jack Krupansky
You can in fact add fields. And you can remove the value of a field for a 
single document. Both can be done by simply re-adding the document with the 
same unique key value. But that won't add that field to other or all 
documents, or remove the value of that field for all other documents.


Re-indexing is usually required when you change the type of a field, its 
attributes, or its analyzer.


The primary difficulty with automatically re-indexing from the index itself 
is that Lucene does not require all fields to be "stored". So, replacing a 
document means you need to have an external source for all non-stored 
fields, or at least non-stored fields which are not copied from another 
field which is stored.


-- Jack Krupansky

-Original Message- 
From: Shaya Potter

Sent: Monday, October 22, 2012 3:47 PM
To: java-user@lucene.apache.org
Subject: Re: understanding the need to reindex a document

I can understand that (i.e. why in place wont work), but the Q would be,
why one can't read in a document, add() or removeField() that document
and then updateDocument() that document.

On 10/22/2012 03:43 PM, Apostolis Xekoukoulotakis wrote:

I am not that familiar with Lucene, so my answer may be a bit off.

Search on the internet about log structured storage. There you will find
why rewriting an entry is better than updating an existing entry.

Leveldb/cassandra/bigTable use it. maybe search these terms as well.

2012/10/22 Shaya Potter 


so there are lots of Qs that are asked about wanting to modify a lucene
document (i.e. remove fields, add fields) but are told that one needs
to reindex.

No one ever answers the technical Q of why this is, and I'm interested in
that.  presumambly because documents aren't stored as documents or even 
in

any form that could be reassembled into a document, but I'm interested in
what the actual answer is.

thanks.

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StandardAnalyzer functionality change

2012-10-24 Thread Jack Krupansky
Yes, by design. StandardAnalyzer implements "simple word boundaries" (the 
technical term is "Unicode text segmentation"), period. As the javadoc says, 
"As of Lucene version 3.1, this class implements the Word Break rules from 
the Unicode Text Segmentation algorithm, as specified in Unicode Standard 
Annex #29." That is a "standard".


See:
http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html

-- Jack Krupansky

-Original Message- 
From: kiwi clive

Sent: Wednesday, October 24, 2012 6:42 AM
To: java-user@lucene.apache.org
Subject: StandardAnalyzer functionality change

Hi all,

Sorry if I'm asking an age old question but we have migrated to lucene 3.6.0 
and I see StandardAnalyzer has changed its behaviour, particularly when 
tokenizing email addresses. From reading the forums, I understand 
StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?



If I pass the string 'u...@domain.com' through these analyzers, I get the 
following tokens:


Using StandardAnalyzer(Version.LUCENE_23):  -->  u...@domain.com (one token)

Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com(two 
tokens)
Using ClassicAnalyzer(Version.LUCENE_36): -->  u...@domain.com  (one 
token)


StandardAnalyzer is normally a good compromise as a default analyzer but the 
failure to keep an email address intact makes it less fit for purpose than 
it used to be. Is this a bug or is it by design ?  If by design, what is the 
reason for the change and is ClassicAnalyzer now the defacto-analyzer to use 
?


Thanks,
Clive 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StandardAnalyzer functionality change

2012-10-24 Thread Jack Krupansky
I didn't explicitly say it, but ClassicAnalyzer does do exactly what you 
want it to do - work break plus email and URL, or StandardAnalyzer plus 
email and URL.


-- Jack Krupansky

-Original Message- 
From: kiwi clive

Sent: Wednesday, October 24, 2012 1:27 PM
To: java-user@lucene.apache.org
Subject: Re: StandardAnalyzer functionality change

Thanks for the responses chaps, very informative, and most appreciated :-)






From: Ian Lea 
To: java-user@lucene.apache.org
Sent: Wednesday, October 24, 2012 4:19 PM
Subject: Re: StandardAnalyzer functionality change

If you want email addresses, UAX29URLEmailAnalyzer is another alternative.


--
Ian.


On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky  
wrote:

Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
technical term is "Unicode text segmentation"), period. As the javadoc 
says,

"As of Lucene version 3.1, this class implements the Word Break rules from
the Unicode Text Segmentation algorithm, as specified in Unicode Standard
Annex #29." That is a "standard".

See:
http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html

-- Jack Krupansky

-Original Message- From: kiwi clive
Sent: Wednesday, October 24, 2012 6:42 AM
To: java-user@lucene.apache.org
Subject: StandardAnalyzer functionality change


Hi all,

Sorry if I'm asking an age old question but we have migrated to lucene 
3.6.0

and I see StandardAnalyzer has changed its behaviour, particularly when
tokenizing email addresses. From reading the forums, I understand
StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?


If I pass the string 'u...@domain.com' through these analyzers, I get the
following tokens:

Using StandardAnalyzer(Version.LUCENE_23):  -->  u...@domain.com (one 
token)


Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com(two
tokens)
Using ClassicAnalyzer(Version.LUCENE_36): -->  u...@domain.com  (one
token)

StandardAnalyzer is normally a good compromise as a default analyzer but 
the

failure to keep an email address intact makes it less fit for purpose than
it used to be. Is this a bug or is it by design ?  If by design, what is 
the
reason for the change and is ClassicAnalyzer now the defacto-analyzer to 
use

?

Thanks,
Clive

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StandardAnalyzer functionality change

2012-10-24 Thread Jack Krupansky

s/work break/word break/

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Wednesday, October 24, 2012 3:52 PM
To: java-user@lucene.apache.org ; kiwi clive
Subject: Re: StandardAnalyzer functionality change

I didn't explicitly say it, but ClassicAnalyzer does do exactly what you
want it to do - work break plus email and URL, or StandardAnalyzer plus
email and URL.

-- Jack Krupansky

-Original Message- 
From: kiwi clive

Sent: Wednesday, October 24, 2012 1:27 PM
To: java-user@lucene.apache.org
Subject: Re: StandardAnalyzer functionality change

Thanks for the responses chaps, very informative, and most appreciated :-)






From: Ian Lea 
To: java-user@lucene.apache.org
Sent: Wednesday, October 24, 2012 4:19 PM
Subject: Re: StandardAnalyzer functionality change

If you want email addresses, UAX29URLEmailAnalyzer is another alternative.


--
Ian.


On Wed, Oct 24, 2012 at 3:56 PM, Jack Krupansky 
wrote:

Yes, by design. StandardAnalyzer implements "simple word boundaries" (the
technical term is "Unicode text segmentation"), period. As the javadoc 
says,

"As of Lucene version 3.1, this class implements the Word Break rules from
the Unicode Text Segmentation algorithm, as specified in Unicode Standard
Annex #29." That is a "standard".

See:
http://lucene.apache.org/core/4_0_0-ALPHA/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html

-- Jack Krupansky

-Original Message- From: kiwi clive
Sent: Wednesday, October 24, 2012 6:42 AM
To: java-user@lucene.apache.org
Subject: StandardAnalyzer functionality change


Hi all,

Sorry if I'm asking an age old question but we have migrated to lucene 
3.6.0

and I see StandardAnalyzer has changed its behaviour, particularly when
tokenizing email addresses. From reading the forums, I understand
StandardAnalyzer was renamed to ClassicAnalyzer - is this the case ?


If I pass the string 'u...@domain.com' through these analyzers, I get the
following tokens:

Using StandardAnalyzer(Version.LUCENE_23):  -->  u...@domain.com (one 
token)


Using StandardAnalyzer(Version.LUCENE_36):  -->  user domain.com(two
tokens)
Using ClassicAnalyzer(Version.LUCENE_36): -->  u...@domain.com  (one
token)

StandardAnalyzer is normally a good compromise as a default analyzer but 
the

failure to keep an email address intact makes it less fit for purpose than
it used to be. Is this a bug or is it by design ?  If by design, what is 
the
reason for the change and is ClassicAnalyzer now the defacto-analyzer to 
use

?

Thanks,
Clive

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there anything in Lucene 4.0 that provides 'absolute' scoring so that i can compare the scoring results of different searches ?

2012-10-25 Thread Jack Krupansky
Could you provide a more concrete definition of what you mean by "absolute 
scoring"? I mean, you can implement your own scoring or "similarity", so 
what exact criteria are you proposing?


Try providing a concise example of a couple of documents and a couple of 
searches and how your propose to score them.


-- Jack Krupansky

-Original Message- 
From: Paul Taylor

Sent: Thursday, October 25, 2012 7:11 AM
To: java-user@lucene.apache.org
Subject: Is there anything in Lucene 4.0 that provides 'absolute' scoring so 
that i can compare the scoring results of different searches ?


Is there anything in Lucene 4.0 that provides 'absolute' scoring so that
i can compare the scoring results of different searches ?

To explain if I do a search for two values fred OR jane and there is a
document that contains both those words exaclty then that document will
score 100, documents that contain only one word will score less. But if
there is no document contain both words but there is one document that
contains fred then that document will score 100 even though it didnt
match jane at all. (Im clearly ignoring all the complexties but you get
my gist)

So all documents returned from a search are scored relative to each
other, but I cannot perform a second score and sensibly compare the
score of the first search with the second whihc is what I would like to do.

Why would I want to this ?

In a music database  have an index of releases and a separate indexes of
artists, usually the user just searches artists or releases. But
sometimes they want to search all and interleave the results from the
two indexes, but its not sensible for me to interleave them based on
their score at the moment.

thanks Paul



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: App supplied docID in lucene possible?

2012-10-25 Thread Jack Krupansky
Have you looked at or decided against an approach like Solr's 
ExternalFileField?


See:
http://lucene.apache.org/solr/4_0_0/solr-core/org/apache/solr/schema/ExternalFileField.html

Is that at least the kind of issue you are trying to deal with?

One final question: How much of a document's field values are stable vs. 
frequently changing? What are the numbers here - total field count, count of 
frequently changed fields, and percentage of documents being updated in some 
period of time?


And, I don't quite follow why you can't just use a unique key for a document 
rather than the low-level Lucene document id.


-- Jack Krupansky

-Original Message- 
From: Ravikumar Govindarajan

Sent: Thursday, October 25, 2012 6:10 AM
To: java-user@lucene.apache.org
Subject: App supplied docID in lucene possible?

We have the need to re-index some fields in our application frequently.

Our typical document consists of

a) Many single-valued {long/int} re-indexable fields
b) Few large-valued {text/string} static fields

We have to re-index an entire document if a single smallish field changes
and it is turning out to be a problem for us. I have gone through the
https://issues.apache.org/jira/browse/LUCENE-3837 proposal where it tries
to work-around this limitation using a secondary mapping of new-old docids.

As I understand, lucene strictly maintains internal doc-id order so that
many queries that depend on it, will work correctly. Segment merges will
also maintain order as well as reclaim deleted doc-ids

There should be many applications like us, which manage index shards
limiting a given shard based on doc-id limits or size. So reclaiming
deleted doc-ids is mostly a non-issue for us.

That leaves us with changing doc-ids. How about leaving open the doc-ids
themselves to the applications, at-least as an option to the needy? Taking
such an approach might inter-leave doc-ids across segments, but within a
segment, the docIds are always in increasing order. There are possibilities
of ghost-deletes, duplicate docIds etc..., but all should be solvable, I
believe.

Fronting these doc-ids during search from all segment readers and returning
the correct value from one of them should be easy. Will it incur a heavy
penalty during search? Another advantage gained, is the triviality of
cross-joining indexes when docIDs are fixed.

There must be many other places where an app supplied docId might make
lucene behave funny. Need some help in identifying those areas at least for
understanding this problem correctly, if not solving it all together.

--
Ravi 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to use/create an alias to a field?

2012-10-25 Thread Jack Krupansky

With edismax in Solr 3.6/4.0 field aliases are supported:

"The syntax for aliasing is f.myalias.qf=realfield. A user query for 
myalias:foo will be queried as realfield:foo."


See:
http://wiki.apache.org/solr/ExtendedDisMax#Field_aliasing_.2BAC8_renaming

-- Jack Krupansky

-Original Message- 
From: Willi Haase

Sent: Thursday, October 25, 2012 9:26 AM
To: java-user@lucene.apache.org
Subject: How to use/create an alias to a field?

Hello

I am using Lucene 3.4 and I have nearly the same question like Jan in his 
post:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200801.mbox/%3C4791E482.60900%40gmx.de%3E


I could found something helpful in "Lucene in Action" or in the web.
What is the recommended way to create an alias to a field?
I want be able to use in the searchquery "au:some_name" or "auth:some_name" 
or "author:some_name", all using the field "author".


Many thanks in advance for any help or recommendations
 Willi : ) 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to use/create an alias to a field?

2012-10-25 Thread Jack Krupansky
I almost added that you could create a subclass of the Lucene query parser, 
as Solr does, and add the aliasing that way. There might/should be field 
aliasing code in Solr that you could easily apply in Solr. There really 
isn't a great reason why aliasing is only available in Solr and not in 
Lucene.


But unless you are prepared to hack on the query parser, alias support isn't 
available in Lucene right now.


-- Jack Krupansky
-Original Message- 
From: Willi Haase

Sent: Thursday, October 25, 2012 11:59 AM
To: java-user@lucene.apache.org
Subject: Re: How to use/create an alias to a field?

Hi Jack

Thank you for your help.
My problem is, I have only a Lucene setup and can not switch to Solr : (

Cheers
 Willi



Von: Jack Krupansky 
An: java-user@lucene.apache.org
Gesendet: 15:57 Donnerstag, 25.Oktober 2012
Betreff: Re: How to use/create an alias to a field?

With edismax in Solr 3.6/4.0 field aliases are supported:

"The syntax for aliasing is f.myalias.qf=realfield. A user query for 
myalias:foo will be queried as realfield:foo."


See:
http://wiki.apache.org/solr/ExtendedDisMax#Field_aliasing_.2BAC8_renaming

-- Jack Krupansky

-Original Message- From: Willi Haase
Sent: Thursday, October 25, 2012 9:26 AM
To: java-user@lucene.apache.org
Subject: How to use/create an alias to a field?

Hello

I am using Lucene 3.4 and I have nearly the same question like Jan in his 
post:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200801.mbox/%3C4791E482.60900%40gmx.de%3E


I could found something helpful in "Lucene in Action" or in the web.
What is the recommended way to create an alias to a field?
I want be able to use in the searchquery "au:some_name" or "auth:some_name" 
or "author:some_name", all using the field "author".


Many thanks in advance for any help or recommendations
Willi : )

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: query for documents WITHOUT a field?

2012-10-25 Thread Jack Krupansky
"OR allergies IS NULL" would be "OR (*:* -allergies:[* TO *])" in 
Lucene/Solr.


-- Jack Krupansky

-Original Message- 
From: Vitaly Funstein

Sent: Thursday, October 25, 2012 8:25 PM
To: java-user@lucene.apache.org
Subject: Re: query for documents WITHOUT a field?

Sorry for resurrecting an old thread, but how would one go about writing a
Lucene query similar to this?

SELECT * FROM patient WHERE first_name = 'Zed' OR allergies IS NULL

An AND case would be easy since one would just use a simple TermQuery with
a FieldValueFilter added, but what about other boolean cases? Admittedly,
this is a contrived example, but the point here is that it seems that since
filters are always applied to results after they are returned, how would
one go about making the null-ness of a field part of the query logic?

On Thu, Feb 16, 2012 at 1:45 PM, Uwe Schindler  wrote:


I already mentioned that pseudo NULL term, but the user asked for another
solution...
--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de



Jamie Johnson  schrieb:

Another possible solution is while indexing insert a custom token
which is impossible to show up in the index otherwise, then do the
filter based on that token.


On Thu, Feb 16, 2012 at 4:41 PM, Uwe Schindler  wrote:
> As the documentation states:
> Lucene is an inverted index that does not have per-document fields. It
only
> knows terms pointing to documents. The query you are searching is a 
> query

> that returns all documents which have no term. To execute this query, it
> will get the term index and iterate all terms of a field, mark those in 
> a

> bitset and negates that. The filter/query I told you uses the FieldCache
to
> do this. Since 3.6 (also in 3.5, but there it is buggy/API different)
there
> is another fieldcache that returns exactly that bitset. The filter
mentioned
> only uses that bitset from this new fieldcache. Fieldcache is populated
on
> first access and keeps alive as long as the underlying index segment is
open
> (means as long as IndexReader is open and the parts of the index is not
> refreshed). If you are also sorting against your fields or doing other
> queries using FieldCache, there is no overhead, otherwise the bitset is
> populated on first access to the filter.
>
> Lucene 3.5 has no easy way to implement that filter, a "NULL" pseudo
term is
> the only solution (and also much faster on the first access in Lucene
3.6).
> Later accesses hitting the cache in 3.6 will be faster, of course.
>
> Another hacky way to achieve the same results is (works with almost any
> Lucene version):
> BooleanQuery consisting of: MatchAllDocsQuery() as MUST clause and
> PrefixQuery(field, "") as MUST_NOT clause. But the PrefixQuery will do a
> full term index scan without caching :-). You may use
CachingWrapperFilter
> with PrefixFilter instead.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Tim Eck [mailto:tim...@gmail.com]
>> Sent: Thursday, February 16, 2012 10:14 PM
>> To: java-user@lucene.apache.org
>> Subject: RE: query for documents WITHOUT a field?
>>
>> Thanks for the fast response. I'll certainly have a look at the 
>> upcoming

> 3.6.x
>> release. What is the expected performance for using a negated filter?
>> In particular does it defeat the index in any way and require a full
index
> scan?
>> Is it different between regular fields and numeric fields?
>>
>> For 3.5 and earlier though, is there any suggestion other than magic
> values?
>>
>> -Original Message-
>> From: Uwe Schindler [mailto:u...@thetaphi.de]
>> Sent: Thursday, February 16, 2012 1:07 PM
>> To: java-user@lucene.apache.org
>> Subject: RE: query for documents WITHOUT a field?
>>
>> Lucene 3.6 will have a FieldValueFilter that can be negated:
>>
>> Query q = new ConstantScoreQuery(new FieldValueFilter("field", true));
>>
>> (see http://goo.gl/wyjxn)
>>
>> Lucen 3.5 does not yet have it, you can download 3.6 snapshots from
> Jenkins:
>> http://goo.gl/Ka0gr
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>> > -Original Message-
>> > From: Tim Eck [mailto:t...@terracottatech.com]
>> > Sent: Thursday, February 16, 2012 9:59 PM
>> > To: java-user@lucene.apache.org
>> > Subject: query for documents WITHOUT a field?
>> >
>> > My apologies if this answer is readily available someplace, I've
>> > searched around

Re: query for documents WITHOUT a field?

2012-10-25 Thread Jack Krupansky
Right another level of BooleanQuery that is a SHOULD clause, with TWO terms: 
a MUST of MatchAllDocsQuery and a MUST_NOT of the TermRangeQuery for 
"allergies" with null for both start and end.


Actually, there is a new filter that you can use to detect empty fields down 
at that level. See

https://issues.apache.org/jira/browse/LUCENE-4386

I think it is:

new ConstantScoreQuery(new FieldValueFilter(fieldname, false))

Use a SHOULD of that rather than a second level of BooleanQuery. Let us know 
if it actually works!


-- Jack Krupansky

-Original Message- 
From: Vitaly Funstein

Sent: Thursday, October 25, 2012 8:55 PM
To: java-user@lucene.apache.org
Subject: Re: query for documents WITHOUT a field?

This is the QueryParser syntax, right? So an API equivalent for the not
null case would be something like this?

BooleanQuery q = new BooleanQuery();
q.add(new BooleanClause(new TermQuery(new Term("first_name", "Zed")),
Occur.SHOULD));
q.add(new BooleanClause(new TermRangeQuery("allergies", null, null, true,
true), Occur.SHOULD));

Whereas, for "IS NULL" the TermRangeQuery above would need to be wrapped in
another BooleanClause with Occur.MUST_NOT?

On Thu, Oct 25, 2012 at 5:29 PM, Jack Krupansky 
wrote:



"OR allergies IS NULL" would be "OR (*:* -allergies:[* TO *])" in
Lucene/Solr.

-- Jack Krupansky

-Original Message- From: Vitaly Funstein
Sent: Thursday, October 25, 2012 8:25 PM
To: java-user@lucene.apache.org
Subject: Re: query for documents WITHOUT a field?


Sorry for resurrecting an old thread, but how would one go about writing a
Lucene query similar to this?

SELECT * FROM patient WHERE first_name = 'Zed' OR allergies IS NULL

An AND case would be easy since one would just use a simple TermQuery with
a FieldValueFilter added, but what about other boolean cases? Admittedly,
this is a contrived example, but the point here is that it seems that 
since

filters are always applied to results after they are returned, how would
one go about making the null-ness of a field part of the query logic?

On Thu, Feb 16, 2012 at 1:45 PM, Uwe Schindler  wrote:

 I already mentioned that pseudo NULL term, but the user asked for another

solution...
--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de



Jamie Johnson  schrieb:

Another possible solution is while indexing insert a custom token
which is impossible to show up in the index otherwise, then do the
filter based on that token.


On Thu, Feb 16, 2012 at 4:41 PM, Uwe Schindler  wrote:
> As the documentation states:
> Lucene is an inverted index that does not have per-document fields. It
only
> knows terms pointing to documents. The query you are searching is a >
query
> that returns all documents which have no term. To execute this query, 
> it

> will get the term index and iterate all terms of a field, mark those in
> a
> bitset and negates that. The filter/query I told you uses the 
> FieldCache

to
> do this. Since 3.6 (also in 3.5, but there it is buggy/API different)
there
> is another fieldcache that returns exactly that bitset. The filter
mentioned
> only uses that bitset from this new fieldcache. Fieldcache is populated
on
> first access and keeps alive as long as the underlying index segment is
open
> (means as long as IndexReader is open and the parts of the index is not
> refreshed). If you are also sorting against your fields or doing other
> queries using FieldCache, there is no overhead, otherwise the bitset is
> populated on first access to the filter.
>
> Lucene 3.5 has no easy way to implement that filter, a "NULL" pseudo
term is
> the only solution (and also much faster on the first access in Lucene
3.6).
> Later accesses hitting the cache in 3.6 will be faster, of course.
>
> Another hacky way to achieve the same results is (works with almost any
> Lucene version):
> BooleanQuery consisting of: MatchAllDocsQuery() as MUST clause and
> PrefixQuery(field, "") as MUST_NOT clause. But the PrefixQuery will do 
> a

> full term index scan without caching :-). You may use
CachingWrapperFilter
> with PrefixFilter instead.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Tim Eck [mailto:tim...@gmail.com]
>> Sent: Thursday, February 16, 2012 10:14 PM
>> To: java-user@lucene.apache.org
>> Subject: RE: query for documents WITHOUT a field?
>>
>> Thanks for the fast response. I'll certainly have a look at the >>
upcoming
> 3.6.x
>> release. What is the expected performance for using a negated filter?
>> In particular does it defeat the index in any way and require a full
index
> scan?
>> Is it different between regul

Re: lucene 4.0 indexReader is changed

2012-10-26 Thread Jack Krupansky

How about DirectoryReader.html#openIfChanged?

See:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/DirectoryReader.html#openIfChanged(org.apache.lucene.index.DirectoryReader)

-- Jack Krupansky

-Original Message- 
From: Scott Smith

Sent: Friday, October 26, 2012 7:54 PM
To: java-user@lucene.apache.org
Subject: lucene 4.0 indexReader is changed

How do I determine if the index has been modified in 4.0? The ifchanged() 
and isChanged() appear to have been removed. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Running Solr Core/ Tika on Azure

2012-10-29 Thread Jack Krupansky
SolrCell includes Tika and SolrCell is included with Solr, at least the 
standard distribution of Solr. You can stream Office and PDF docs directly 
to the extracting request handler where Tika will process them. You can also 
ask SolrCell to "extract only" and return the extracted content.


See:
http://wiki.apache.org/solr/ExtractingRequestHandler

Whether the Azure distribution is "full" Solr including Solr Cell or not, I 
cannot answer.


Note: For future reference, "Solr" questions should be asked on the 
"solr-user" mailing list.


-- Jack Krupansky

-Original Message- 
From: Aloke Ghoshal

Sent: Monday, October 29, 2012 3:22 AM
To: java-user@lucene.apache.org ; gene...@lucene.apache.org
Subject: Running Solr Core/ Tika on Azure

Hi,

Looking for feedback on running Solr Core/ Tika parsing engine on Azure.
There's one offering for Solr within Azure from Lucid works. This offering
however doesn't mention Tika.

We are looking at options to make content from files (doc, excel, pdfs,
etc.) stored within Azure storage search-able. And whether the parser could
run against our Azure store directly to index the content. The other option
could be to write a separate connector that streams in the files. Let me
know if you have experience along these lines.

Regards,
Aloke 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: using CharFilter to inject a space

2012-11-04 Thread Jack Krupansky
I still think that we're looking at an "XY Problem" here, haggling over a 
"solution" when the problem has not been clearly and fully stated.


In particular, rather than parsing straight natural language text, the data 
appears to have a structured form. Until the structure is fully defined, 
detailing a parser, especially by playing games such as "injecting spaces" 
is an exercise in futility. I mean, you MIGHT come up with a solution that 
SEEMS to work (at least for SOME cases), and MAY make you happy, but I would 
hate to see other Lucene users adopt such an approach to problem solving.


Tell us the full problem and then we can focus on legitimate "solutions".

-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Sunday, November 04, 2012 8:06 AM
To: java-user
Subject: Re: using CharFilter to inject a space

Ahh, I don't know of a better way. I can imagine complex solutions
involving something akin to WordDelimiterFilter... and I can imagine that
that would be ridiculously expensive to maintain when there are really
simple solutions like you're looking at.

Mostly I was curious about your use-case

Erick


On Sat, Nov 3, 2012 at 11:35 PM, Igal @ getRailo.org 
wrote:



well, my main goal is to use a ShingleFilter that will only take shingles
that are not separated by commas etc.

for example, the phrase:

"red apples, green tomatoes, and brown potatoes"

should yield the shingles "red apples", "green tomatoes", "and brown",
"brown potatoes"; but not "apples green" and not "tomatoes and" as those
are separated by commas.

the problem with the common tokenizers is that they get rid of the commas
so if I use a ShingleFilter after them there's no way to tell if there was
a comma there or not.

(another option I consider is to add an Attribute to specify if there was
a comma before or after a token)

if there's a better way -- I'm open to suggestions,


Igal



On 11/3/2012 8:10 PM, Erick Erickson wrote:


So I've gotta ask... _why_ do you want to inject the spaces?
If it's just to break this up into tokens,  wouldn't something like
LetterTokenizer do? Assuming you aren't interested in
leaving in numbers Or even StandardTokenizer unless you have
e-mail & etc.

Or what about PatternReplaceCharFilter?

FWIW,
Erick



On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir  wrote:

 You're right.  I'm not sure what I was thinking.


Thanks for all your help,

Igal
  On Nov 3, 2012 5:44 PM, "Robert Muir"  wrote:

 On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org 

wrote:


hi Robert,

thank you for your replies.

I couldn't find much documentation/examples of this, but this is what 
I



came


up with (below).  is that the way I'm supposed to use the


MappingCharFilter?
You don't need to extend anything.
You also don't want to create a normalizecharmap for each reader
(thats way too heavy)

Just build the NormalizeCharMap once, and pass it to
MappingCharFilter's Constructor.

--**--**
-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: content disappears in the index

2012-11-12 Thread Jack Krupansky
Maybe... the author names have middle or first initials? Like, maybe the 
"Arslanagic" dude has an "A" initial in his name, like "A. Arslanagic" or 
"Arslanagic, A.".


In any case, "string" is the proper type for a sorted field, although it 
would be nice if Lucene/Solr was more developer-friendly when this "mistake" 
is made.


The relevant doc is:

"Sorting can be done on the "score" of the document, or on any 
multiValued="false" indexed="true" field provided that field is either 
non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a 
single Term (ie: uses the KeywordTokenizer)"

...
"The common situation for sorting on a field that you do want to be 
tokenized for searching is to use a  to clone your field. Sort on 
one, search on the other."


See:
http://wiki.apache.org/solr/CommonQueryParameters

For example, have an "author" field that is "text" and an "author_s" (or 
"author_sorted" or "author_string") field that you copy the name to:


   

Query on "author", but sort on "author_s".

-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Monday, November 12, 2012 5:28 AM
To: java-user
Subject: Re: content disappears in the index

First, sorting on tokenized fields is undefined/unsupported. You _might_
get away with it if the author field always reduces to one token, i.e. if
you're always indexing only the last name.

I should say unsupported/undefined when more than one token is the result
of analysis. You can do things like use the KeywordTokenizer followed by
tranformations on the _entire_ input field (lowercasing is popular for
instance).

So somehow the analysis chain you have defined for this field grabs
"Arslanagic"
and translates it into "a". Synonyms? Stemming? Some "interesting" sequence?

The fastest way to look at that would be in Solr's admin/analysis page.
Just put Arslanagic into the index box and you should see which of the
steps does the translation. Although changing it to "a" is really weird,
it's almost certainly something you've defined in the indexing analysis
chain.

FWIW,
Erick


On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:


Hi list,
a user reported wrong sorting of our search service running on solr.
While chasing this issue I traced it back through lucene into the index.
I have a text field for sorting
(stored,indexed,tokenized,omitNorms,sortMissingLast)
and three docs with author names.

If I trace at org.apache.lucene.document.Document.add(IndexableField) 
while
indexing I can see all three author names added as field to each 
documents.


After searching with *:* for the three docs and doing a sort the sorting
is wrong
because one of the author names is reduced to the first char, all other
chars are lost.

So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
the result
of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is wrong.
But this happens because the author "Arslanagic" is reduced to "a" during
indexing (???)
and if sorted "a" is before "alexander".

Currently I use 4.0 but have the same issue with 3.6.1.

Without tracing through tons of code:
- which is the last breakpoint for debugging to see the docs right before
they go into the index
- which is the first breakpoint for debugging to see the docs coming right
out of the index

Regards
Bernd

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which stemmer?

2012-11-14 Thread Jack Krupansky
What is your use case? If you don't have a specific use case in mind, try 
each of them with some common words that you expect will or won't be 
stemmed. If you have Solr, you can experiment interactively using the Solr 
Admin Analysis web page.


It would be nice if the javadoc for each stemmer gave a handful of examples 
that illustrated how some common words are stemmed.


-- Jack Krupansky

-Original Message- 
From: Scott Smith

Sent: Wednesday, November 14, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Which stemmer?

Does anyone have any experience with the stemmers?  I know that Porter is 
what "everyone" uses.  Am I better off with KStemFilter (better performance) 
or ??  Does anyone understand the differences between the various stemmers 
and how to choose one over another? 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which stemmer?

2012-11-14 Thread Jack Krupansky
Another word set to try: invest, investing, investment, investments, 
invests, investor, invester, investors, investers.


Also, take a look at EnglishMinimalStemmer (EnglishMinimalStemFilterFactory) 
for minimal stemming.


See:
http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemFilterFactory.html
http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemmer.html

-- Jack Krupansky

-Original Message- 
From: Scott Smith

Sent: Wednesday, November 14, 2012 5:17 PM
To: java-user@lucene.apache.org
Subject: RE: Which stemmer?

Unfortunately, my "use case" is a customer who wants stemming, but has very 
little knowledge of what that means except they think they want it.


I agree with your last comment.  So, here's my contribution:

 Original  porter   kstem minStem
  --- --- --- ---
  country countri country country
  run run run run
 runs runruns run
  running run running running
 readreadreadread
  readingread reading reading
   reader  reader  reader  reader
association associ association association
associate  associ   associate   associate
  listinglistlist listing
water   water   water   water
  watered   water   water watered
 suresuresuresure
   surelysure  surely  surely
   fred's   fred'  fred's   fred'
rosesroseroserose

Still not sure which one to pick.  Porter is more aggressive.  Min stemmer 
is pretty minimal.  Perhaps the kstemmer is "just right" :-)


Cheers

Scott

-----Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Wednesday, November 14, 2012 4:14 PM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

What is your use case? If you don't have a specific use case in mind, try 
each of them with some common words that you expect will or won't be 
stemmed. If you have Solr, you can experiment interactively using the Solr 
Admin Analysis web page.


It would be nice if the javadoc for each stemmer gave a handful of examples 
that illustrated how some common words are stemmed.


-- Jack Krupansky

-Original Message-
From: Scott Smith
Sent: Wednesday, November 14, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Which stemmer?

Does anyone have any experience with the stemmers?  I know that Porter is 
what "everyone" uses.  Am I better off with KStemFilter (better performance) 
or ??  Does anyone understand the differences between the various stemmers 
and how to choose one over another?



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which stemmer?

2012-11-15 Thread Jack Krupansky
One other factor to keep in mind is that the customer should never "look" at 
the actual stem term - such as "countri" or "gener" because in can freak 
them out a little, for no good reason. I mean, the goal of stemming is to 
show what set of words/terms will be treated as equivalent on a query, and 
this is independent of what gets returned for a stored field. The stem is 
simply the means to THAT end.


The fact that "dog" and "dogs" are not equivalent in KStem is in fact 
disheartening, at least to me, but it may not be problematic in some use 
cases.


-- Jack Krupansky

-Original Message- 
From: Scott Smith

Sent: Thursday, November 15, 2012 11:57 AM
To: java-user@lucene.apache.org
Subject: RE: Which stemmer?

Thanks for the suggestions I think Erick is correct as well.  I'll let the 
customer decide.


Here's an updated list.  Fyi--the minStem was the English Minimal Stemmer--I 
changed the label.  Interesting to see where the minimal stemmer and porter 
agree (and KStemmer doesn't).  You may also find the "dog" examples 
interesting.  I also found the "invest*" list entertaining.


  original   porterkstem   EngMinStem
---  ---  ---  ---
   country  countri  country  country
 countries  countri  country  country
 country's country'country's country'
   run  run  run  run
  runs  run runs  run
   running  run  running  running
  read read read read
   reading read  reading  reading
reader   reader   reader   reader
association   associ  association  association
 associate   associassociateassociate
   listing list list  listing
 waterwaterwaterwater
   wateredwaterwater  watered
  sure sure sure sure
surely sure   surely   surely
invest   invest   invest   invest
 investing   invest   investinvesting
investment   invest   investment   investment
investments   invest   investment   investment
   invests   invest   invest   invest
  investor investor   invest investor
  invester   invest   invest invester
 investors investor   invest investor
 investers   invest   invest invester
organizationorgan  organization  organization
  organizeorgan organize organize
   organicorgan  organic  organic
  generousgener generous generous
   genericgener  generic  generic
   dog  dog  dog  dog
 dog's dog'dog's dog'
  dogs  dog dogs  dog
 dogs'  dog dogs  dog

Now, if someone would answer my question on the Solr list ("Custom Solr 
Indexer/Search"), my day would be complete ;-).


Thanks for the continued help.

Scott

-Original Message-
From: Tom Burton-West [mailto:tburt...@umich.edu]
Sent: Thursday, November 15, 2012 11:06 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

I agree with Erick that you probably need to give your client a list of 
concrete examples, and perhaps to explain the trade-offs.


All stemmers both overstem and understem.   Understemming means that some
forms of a word won't get searched.  For example, without stemming, 
searching for "dogs" would not retrieve documents containing the word "dog".
Generally there is a precision/recall tradeoff where reducing understemming 
increases overstemming.  The problem with aggressive stemmers like the 
Porter stemmer, is that they overstem.


The original Porter stemmer for example would stem "organization" and " 
organic" both to "organ" and "generalization" , "generous"and "generic" to " 
gener"  *


For background on the Porter stemmers and lots of examples see these pages:

http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*

*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>

This paper on the Kstem stemmer lists cases where the Porter stemmer 
understems or overstems and explains the logic of Kstem: "Viewing Morphology 
as an Inference Process"  (*Krovetz*, R., Proceedings of the Sixteenth 
Annual International ACM SIGIR Conference on Research and Development in 
Information Retrieval, 191-203, 1993).


*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf&

Re: how do re-get the doc after the doc was indexed ?

2012-11-17 Thread Jack Krupansky
Sounds like you want path to be a unique key field. So, just do a Lucene 
search with a TermQuery for the path, which should return one document. No 
need to mess with Lucene internal doc ids.


-- Jack Krupansky

-Original Message- 
From: wgggfiy

Sent: Saturday, November 17, 2012 8:08 AM
To: java-user@lucene.apache.org
Subject: how do re-get the doc after the doc was indexed ?

for example:
I indexed a doc with path=c:/books/appach.txt, author=Mike
After a long time, I wanted to modify the author to John.
But the quethion is how I can get the exact same doc fastly ??
My idea is to traverse the docs from id=0 to id=maxDoc(), and
retrive it with store fields, and check its path whether equals
path=c:/books/appach.txt.
Any better idea ??
thx




--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-do-re-get-the-doc-after-the-doc-was-indexed-tp4020865.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about ordering rule of SpanNearQuery

2012-11-19 Thread Jack Krupansky
Unfortunately, there doesn't appear to be any Javadoc that discusses what 
factors are used to score spans. For example, how to relate the number of 
times a span matches in a document vs. the exactness of each span match.


-- Jack Krupansky

-Original Message- 
From: 杨光

Sent: Monday, November 19, 2012 5:02 AM
To: java-user@lucene.apache.org
Subject: Question about ordering rule of SpanNearQuery

Hi all,
   Recently, we are developing a platform with lucene. The ordering rule we 
specified is the document with the shortest distance between query terms 
ranks the first. But there may be a little different with SpanNearQuery. It 
returns all the documents with qualified distance. So I am confused with the 
ordering rule about SpanNearQuery. For example, I indicate the slot in 
SpanNearQuery is 10. And the results are all the qualified documents. Is it 
true that any document with shorter distance between query rand before the 
one with longer distance without considering the tf-idf algorithm? Or among 
all the qualified documents, it till uses tf-idf algorithm to rank the docs. 
Or there is some complex algorithm blending the distance and tf-idf 
algorithm.





Thanks in advance.
Kobe






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Line feed on windows

2012-11-20 Thread Jack Krupansky
This doesn't sound like a Lucene issue. It's up to you to read a file and 
pass it as a string to Lucene. Maybe you're trying to read the file one line 
at a time, in which case it is up to you to supply line delimiters when 
combining the lines into a single string. Try reading the full file into a 
single string, line delimiters and all. Be careful about encoding though.


-- Jack Krupansky

-Original Message- 
From: Mansour Al Akeel

Sent: Tuesday, November 20, 2012 1:19 PM
To: java-user
Subject: Line feed on windows

Hello all,
We are indexing and storing files contents in lucene index. These
files contains line feed "\n" as end of line character. Lucene is
storing the content as is,
however when we read them, the "\n" is removed and we end up with text
that is concatenated when there's no space.

I can re-read the files from the filesystem to avoid this, but I like
to see if there is other alternatives.

Thank you.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about ordering rule of SpanNearQuery

2012-11-21 Thread Jack Krupansky
Add &debugQuery=true to your query and look at the "explain" section to see 
how the scoring is calculated for each document. Sometimes it is 
counter-intuitive and some factors may differ but those differences can be 
overwhelmed by other, unrelated factors.


-- Jack Krupansky

-Original Message- 
From: 杨光

Sent: Wednesday, November 21, 2012 10:26 AM
To: java-user@lucene.apache.org
Subject: Question about ordering rule of SpanNearQuery

Hi all,
   Recently, we are developing a platform with lucene. The ordering rule we 
specified is the document with the shortest distance between query terms 
ranks the first. But there may be a little different with SpanNearQuery. It 
returns all the documents with qualified distance. So I am confused with the 
ordering rule about SpanNearQuery. For example, I indicate the slot in 
SpanNearQuery is 10. And the results are all the qualified documents. Is it 
true that any document with shorter distance between query rand before the 
one with longer distance without considering the tf-idf algorithm? Or among 
all the qualified documents, it till uses tf-idf algorithm to rank the docs. 
Or there is some complex algorithm blending the distance and tf-idf 
algorithm.

   Thanks in advance.



--

Guang Yang,
Dept. of Computer Science
Peking University, 100080
Beijing, China
Tel: +86 18631516893 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about ordering rule of SpanNearQuery

2012-11-21 Thread Jack Krupansky
Oops... sorry, I just noticed that you are a Lucene, not Solr, user. Call 
the IndexSearcher#explain method to get the explanation and call the 
toString method on the explanation to see the readable text.


http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query, 
int)


-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Wednesday, November 21, 2012 11:44 AM
To: java-user@lucene.apache.org
Subject: Re: Question about ordering rule of SpanNearQuery

Add &debugQuery=true to your query and look at the "explain" section to see
how the scoring is calculated for each document. Sometimes it is
counter-intuitive and some factors may differ but those differences can be
overwhelmed by other, unrelated factors.

-- Jack Krupansky

-Original Message- 
From: 杨光

Sent: Wednesday, November 21, 2012 10:26 AM
To: java-user@lucene.apache.org
Subject: Question about ordering rule of SpanNearQuery

Hi all,
   Recently, we are developing a platform with lucene. The ordering rule we
specified is the document with the shortest distance between query terms
ranks the first. But there may be a little different with SpanNearQuery. It
returns all the documents with qualified distance. So I am confused with the
ordering rule about SpanNearQuery. For example, I indicate the slot in
SpanNearQuery is 10. And the results are all the qualified documents. Is it
true that any document with shorter distance between query rand before the
one with longer distance without considering the tf-idf algorithm? Or among
all the qualified documents, it till uses tf-idf algorithm to rank the docs.
Or there is some complex algorithm blending the distance and tf-idf
algorithm.
   Thanks in advance.



--

Guang Yang,
Dept. of Computer Science
Peking University, 100080
Beijing, China
Tel: +86 18631516893


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which stemmer?

2012-11-21 Thread Jack Krupansky

Great! For my favorite example of "invest", "invests", etc. it shows:

SnowballEnglish:
•investment
•invest
•invests
•investing
•invested

kStem:
•investors
•invest
•investor
•invests
•investing
•invested

minimalStem:invest
•invest
•invests

That highlights the distinctions between these stemmers quite well, without 
highlighting the actual indexed term, which can be quite ugly.


-- Jack Krupansky

-Original Message- 
From: Elmer van Chastelet

Sent: Wednesday, November 21, 2012 8:49 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

I've just created a small web application which you might find useful.
You can see which words are matched by a query word when using different
analyzers  (phonetic and stemming analyzers).
These include snowball, kstem and minimal stem (the ones on the right).

http://dutieq.st.ewi.tudelft.nl/wordsearch/

I can extend the app with more analyzers. Please let me know :)

--Elmer

Example

On 11/14/2012 07:55 PM, Scott Smith wrote:
Does anyone have any experience with the stemmers?  I know that Porter is 
what "everyone" uses.  Am I better off with KStemFilter (better 
performance) or ??  Does anyone understand the differences between the 
various stemmers and how to choose one over another?





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How does lucene handle the wildcard and fuzzy queries ?

2012-11-27 Thread Jack Krupansky
The proper answer to all of these questions is the same and very simple: If 
you want "internal" details, read the source code first. If you have 
specific questions then, fine, ask specific questions - but only after 
you've checked the code first.


Also, questions or issues related to "internals" aren't appropriate on 
"user" lists.


-- Jack Krupansky

-Original Message- 
From: sri krishna

Sent: Tuesday, November 27, 2012 12:36 PM
To: java-user@lucene.apache.org
Subject: How does lucene handle the wildcard and fuzzy queries ?

How does lucene handle the prefix queries(wild card) and fuzzy queries
internally?

Lucene stores date in in for of inverted index in segments, i.e term->doc
id's. How does it search a word in the term list efficiently? And how does
it handle the adv queries on same the inverted index?


Thanks 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: handling different scores related to queries

2012-11-27 Thread Jack Krupansky
Call the IndexSearch#explain method to get the technical details on how any 
query is scored. Call Explanation#toString to get the English description 
for the scoring.


Or, using Solr, add the &debugQuery=true parameter to your query request and 
look at the "explain" section for scoring calculations.


Some of these complex queries are "constant score" for performance reasons.

-- Jack Krupansky

-Original Message- 
From: sri krishna

Sent: Tuesday, November 27, 2012 12:38 PM
To: java-user
Subject: handling different scores related to queries

for a search string hello*~ how the scoring is calculated?

as the formula given in the url:
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_1/api/core/org/apache/lucene/search/Similarity.html,
doesn't take into consideration of edit distance(levenshtein distance) and
prefix term corresponding factors into account.

Does lucene add up the scores obtained from each type of query included i.e
for the above query actual score=default scoring+1/(edit distance)+prefix
match score ?, If so, there is no normalization between scores, else what
is the approach lucene follows starting from seperating each query based
identifiers like (~(edit distance), *(prefix query) etc) to actual scoring. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Jack Krupansky
"I will probably have to implement my own datastructure and 
parser/tokenizer/stemmer"


Why? I mean, I think the point of the Lucene architecture is that the codec 
level is completely independent of the analysis level.


The end result of analysis is a value to be stored from the application 
perspective, a "logical value" so to speak, but NOT the bit sequence, the 
"physical value" so to speak, that the codec will actually store.


So, go ahead and have your own codec that does whatever it wants with 
values, but the input for storage and query should be the output of a 
standard Lucene analyzer.


-- Jack Krupansky

-Original Message- 
From: Johannes.Lichtenberger

Sent: Friday, November 30, 2012 10:15 AM
To: java-user@lucene.apache.org
Cc: Michael McCandless
Subject: Re: What is "flexible indexing" in Lucene 4.0 if it's not the 
ability to make new postings codecs?


On 11/28/2012 01:11 AM, Michael McCandless wrote:

Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).

So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes "through"
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)


Regarding my questin/thread, is it also possible to change the backend
system? I'd like to use Lucene for a versioned DBMS, thus I would need
the ability to serialize/deserialize the bytes in my backend whereas
keys/values are stored in pages (for instance in an upcoming B+-tree, or
in  simple "unordered" pages via a record-ID/record mapping). But as no
one suggested anything as of now and I've also asked a year ago or so,
after implementing the B+-tree I will probably have to implement my own
datastructure and parser/tokenizer/stemmer... :-(

kind regards,
Johannes


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using alternative scoring mechanism.

2012-12-02 Thread Jack Krupansky
I thought it was that simple too, but I couldn't find the 
"get/setSimilarityProvider" methods listed in that patch, and no mention in 
the Similarity class. Obviously this feature has morphed a bit since then. 
To cut to the chase, you still use the old methods for setting the 
similarity class (IndexSearcher#setSimilarity), but now you need to 
instantiate a PerFieldSimilarityWrapper that provides a "get" method for 
each field. It is mentioned in the Similarity javadoc, if you read it 
carefully enough.


See:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/PerFieldSimilarityWrapper.html

This appear to be the change from that original design:
https://issues.apache.org/jira/browse/LUCENE-3749

-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Sunday, December 02, 2012 9:54 AM
To: java-user
Subject: Re: Using alternative scoring mechanism.

I think you're looking for per-field similiarity, does this help?
https://issues.apache.org/jira/browse/LUCENE-2236

Note, in 4.0 only

Best
Erick


On Sat, Dec 1, 2012 at 1:43 PM, Eyal Ben Meir  wrote:


Can one replace the basic scoring algorithm (TF/IDF) for a specific field,
to use a different one?

I need to compute similarity for NAME field.  The regular TF/IDF is not
good
enough, and I want to use a Name Recognition Engine as a scorer.

How can it be done?

Thanks,  Eyal.





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is Analyzer used when calling IndexWriter.addIndexesNoOptimize()?

2012-12-05 Thread Jack Krupansky
These are operations on indexes, so analysis is no longer relevant. Analysis 
is performed BEFORE data is placed in an index. You still need to perform 
analysis for queries though.


You can use different analyzers as long as they are compatible - if they 
produce the same results, but if they don't produce the same token stream at 
both the token and character/byte level, your queries may fail.


Rule #1 with Lucene and Solr - always be prepared to completely reindex your 
data, precisely because ideas about analysis evolve over time.


-- Jack Krupansky

-Original Message- 
From: Earl Hood

Sent: Wednesday, December 05, 2012 2:33 AM
To: java-user@lucene.apache.org
Subject: Is Analyzer used when calling IndexWriter.addIndexesNoOptimize()?

Lucene version: 3.0.3

Does IndexWriter use the analyzer when adding indexes via
addIndexesNoOptimize()?

What about for optimize()?

I am examining some existing code and trying to determine
what effects there may be when combining multiple indexes
into a single index, but each index may have had different
analyzers used.

Thanks,

--ewh

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boolean and SpanQuery: different results

2012-12-13 Thread Jack Krupansky
Can you provide some examples of terms that don't work and the index token 
stream they fail on?


Make sure that the Analyzer you are using doesn't do any magic on the 
indexed terms - your query term is unanalyzed. Maybe multiple, but distinct, 
index terms are analyzing to the same, but unexpected term.


-- Jack Krupansky

-Original Message- 
From: Carsten Schnober

Sent: Thursday, December 13, 2012 10:49 AM
To: java-user@lucene.apache.org
Subject: Boolean and SpanQuery: different results

Hi,
I'm following Grant's advice on how to combine BooleanQuery and
SpanQuery
(http://mail-archives.apache.org/mod_mbox/lucene-java-user/201003.mbox/%3c08c90e81-1c33-487a-9e7d-2f05b2779...@apache.org%3E).

The strategy is to perform a BooleanQuery, get the document ID set and
perform a SpanQuery restricted by those documents. The purpose is that I
need to retrieve Spans for different terms in order to extract their
respective payloads separately, but a precondition is that possibly
multiple terms occur within the documents. My code looks like this:

/* reader and terms are class variables and have been declared finally
before */
Reader reader = ...;
List terms = ...

/* perform the BooleanQuery and store the document IDs in a BitSet */
BitSet bits = new BitSet(reader.maxDoc());
AllDocCollector collector = new AllDocCollector
BooleanQuery bq = new BooleanQuery();
for (String term : terms)
 bq.add(new org.apache.lucene.search.RegexpQuery(new
Term(config.getFieldname(), term)), Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
for (ScoreDoc doc : collector.getHits())
 bits.set(doc.doc);

/* get the spans for each term separately */
for (String term : terms) {
 String payloads = retrieveSpans(term, bits);
 // process and print payloads for term ...
}

def String retrieveSpans(String term, BitSet bits) {
 StringBuilder payloads = new StringBuilder();
 Map termContexts = new HashMap<>();
 Spans spans;
 SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper<>(new
RegexpQuery(new Term("text", term))).rewrite(reader);

 for (AtomicReaderContext atomic : reader.leaves()) {
   spans = sq.getSpans(atomic, new DocIdBitSet(bits), termContexts);
   while (luceneSpans.next()) {
 // extract and store payloads in 'payloads' StringBuilder
   }
 }
 return payloads.toString();
}


This construction seemed to be working fine at first, but I noticed a
disturbing behaviour: for many terms, the BooleanQuery when fed with one
RegexpQuery only matches a larger number of documents than the SpanQuery
constructed from the same RegexpQuery.
With the BooleanQuery containing only one RegexpQuery, the number should
be identical, while with multiple Queries added to the BooleanQuery, the
SpanQuery should return an equal number or more results. This behaviour
is reproducible reliably even after re-indexing, but not for all tokens.
Does anyone have an explanation for that?

Best,
Carsten

--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



precisionStep for days in TrieDate

2012-12-14 Thread Jack Krupansky
If I specify a precisionStep of 26 for a TrieDate field, what rough impact 
should this have on both performance and index size?

The input data has time in it, but the milliseconds per day is not needed for 
the app. Will Lucene store only the top 64 minus 26 bits of data and discard 
the low 26 bits?

I’ve read that a higher precisionStep will lower performance. Will a 
precisionStep of 26 have dramatically lower performance when referencing days 
(without time of day)?

I suppose that the piece of information I am missing is whether trie 
precisionStep simply affects some extra index table that trie keeps beyond the 
raw data values or the data values themselves.

-- Jack Krupansky

Re: precisionStep for days in TrieDate

2012-12-14 Thread Jack Krupansky
Thanks, you answered the main question - 26 doesn't simply lop off the time 
of day. Although, I still don't completely follow how trie works (without 
reading the paper itself.)


-- Jack Krupansky

-Original Message- 
From: Uwe Schindler

Sent: Friday, December 14, 2012 5:58 PM
To: java-user@lucene.apache.org
Subject: RE: precisionStep for days in TrieDate

Hi,


If I specify a precisionStep of 26 for a TrieDate field, what rough impact
should this have on both performance and index size?


This value is mostly useless, everything > 8 does slowdown the queries tot 
he speed of TermRangeQuery.


The input data has time in it, but the milliseconds per day is not needed 
for
the app. Will Lucene store only the top 64 minus 26 bits of data and 
discard

the low 26 bits?


No, you may need to read the Javadocs of NumericRangeQuery, now updated with 
formulas: http://goo.gl/nyXQR
The precisionStep is a count, after how many bits of the indexed value a new 
term starts. The original value is always indexed in full precision. 
Precision step of 4 for a 32 bit value(integer) means terms with these bit 
counts:
All 32, left 28, left 24, left 20, left 16, left 12, left 8, left 4 bits of 
the value (total 8 terms/value). A precision step of 26 would index 2 terms: 
all 32 bits and one single term with the remaining 6 bits from the left.



I’ve read that a higher precisionStep will lower performance. Will a
precisionStep of 26 have dramatically lower performance when referencing
days (without time of day)?


See above. The assumption that 26 will limit precision to days is wrong.


I suppose that the piece of information I am missing is whether trie
precisionStep simply affects some extra index table that trie keeps beyond
the raw data values or the data values themselves.


It only affects how the value is indexed (how many terms), but not the 
value.


Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Help needed: search is returning no results

2012-12-18 Thread Jack Krupansky
Maybe you wanted "text" fields that are analyzed and tokenized, as opposed 
to string fields which are not analyzed and stored and queried exactly 
as-is.


See:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/document/TextField.html

But, show us some of your indexed data and queries that fail.

-- Jack Krupansky

-Original Message- 
From: Ramon Casha

Sent: Tuesday, December 18, 2012 9:14 AM
To: java-user@lucene.apache.org
Subject: Help needed: search is returning no results

I have just downloaded and set up Lucene 4.0.0 to implement a search
facility for a web app I'm developing.

Creating the index seems to be successful - the files created contain
the text that I'm indexing. However, search is returning no results.
The code I'm using is fairly similar to the examples given.

Here is the search code:

   private static final File index = new File("/tmp/naturopedia-index");
   private static final Version VERSION = Version.LUCENE_40;

   public String search() throws IOException, ParseException {
   Analyzer analyzer = new StandardAnalyzer(VERSION);
   Directory directory = FSDirectory.open(index);

   DirectoryReader ireader = DirectoryReader.open(directory);
   IndexSearcher isearcher = new IndexSearcher(ireader);

   QueryParser parser = new QueryParser(VERSION, "labels", analyzer);

   Query q = parser.parse(getQuery());
   ScoreDoc[] hits = isearcher.search(q, 1000).scoreDocs;

   for (int i = 0; i < hits.length; i++) {
   Document hitDoc = isearcher.doc(hits[i].doc);
   System.out.println(hitDoc.getField("id"));
   }
   ireader.close();
   directory.close();
   return "search";
   }
--
Every time I try this, the search returns zero results. I tried with
different fields (text and labels), both of which are indexed, and I
tried different words. Any help would be appreciated.

Here is the code for producing the index:

public String crawl() throws IOException {
   Analyzer analyzer = new StandardAnalyzer(VERSION);
   Directory directory = FSDirectory.open(index);

   IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
   config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
   IndexWriter iwriter = new IndexWriter(directory, config);

   DB db = new DB(); // JPA interface class.

   for ( Taxonomy t : db.taxonomy.all() ) {
   LOG.log(Level.INFO, "Scanning {0}", t);

   Document doc = new Document();
   doc.add(new LongField("id", t.getId(), Store.YES));
   LOG.log(Level.INFO, "  id={0}", t.getId());
   StringBuilder text = new StringBuilder();
   for(Text l : t.getTexts()) {
   text.append(l.getWikiText())
   .append((' '));
   }
   if(text.length() > 0) {
   doc.add(new StringField("text", text.toString(),
Field.Store.YES));
   LOG.log(Level.INFO, "  text={0}", text);
   }

   StringBuilder labels = new StringBuilder();
   toIndex(labels, t.getLabels());
   for(Image l : t.getImages()) {
   toIndex(labels, l.getLabels());
   }
   doc.add(new StringField("labels", labels.toString(),
Field.Store.NO));
   LOG.log(Level.INFO, "  labels={0}", labels);
   StringBuilder sb = new StringBuilder();
   for ( Tag tag : t.getTags() ) {
   toIndex(sb, tag.getLabels());
   }
   if(!sb.toString().isEmpty()) {
   doc.add(new StringField("tags", sb.toString(), 
Field.Store.NO));

   LOG.log(Level.INFO, "  tags={0}", sb);
   }
   iwriter.addDocument(doc);
   }

   db.close();
   iwriter.close();

   return "search";
   }

   private void toIndex(StringBuilder sb, LabelGroup lg) {
   for(Label l : lg.getLabels()) {
   sb.append(l.getText());
   sb.append(" ");
   }
   }


--
Ramon Casha

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NGramPhraseQuery with missing terms

2012-12-19 Thread Jack Krupansky
"a BooleanQuery, but it requires me to consider every possible pair of terms 
(since any one of the terms could be missing)"


What about setting minMatch and all the terms as "SHOULD" - and then 
minMatch could be tuned for how many missing terms to tolerate?


See:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int)

-- Jack Krupansky

-Original Message- 
From: 김한규

Sent: Wednesday, December 19, 2012 2:36 AM
To: java-user@lucene.apache.org
Subject: NGramPhraseQuery with missing terms

Hi.

I am trying to make a NGramPhrase query that could tolerate terms missing,
so even if one of the NGrams doesn't match it still gets picked up by
search.
I know I could use the combination of normal SpanNearQuery and a
BooleanQuery, but it requires me to consider every possible pair of terms
(since any one of the terms could be missing) and it gets too messy and
expensive.

What I want to try is to use SpanTermQuery to get the positions of the
mathcing NGrams and list the spans' position informations in an order, so
that I could pick up any two or more spans near each other to score them
accordingly, but I can't figure out how can I combine the spans.

Any help in solving this issue is appreciated. Also, if there is an example
of a simple scoring implementation example that combines multiple queries'
results, it would be very nice. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which token filter can combine 2 terms into 1?

2012-12-21 Thread Jack Krupansky
And to be more specific, most query parsers will have already separated the 
terms and will call the analyzer with only one term at a time, so no term 
recombination is possible for those parsed terms, at query time.


-- Jack Krupansky
-Original Message- 
From: Erick Erickson

Sent: Friday, December 21, 2012 8:27 AM
To: java-user
Subject: Re: Which token filter can combine 2 terms into 1?

If it's a fixed list and not excessively long, would synonyms work?

But if theres some kind of logic you need to apply, I don't think you're
going to find anything OOB.
The problem is that by the time a token filter gets called, they are
already split up, you'll probably
have to write a custom filter that manages that logic.

Best
Erick


On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen  wrote:


Unfortunately, no...I am not combine every two term into one. I am
combining a specific pair.

E.g. the Token Stream: t1 t2 t2a t3
should be rewritten into t1 t2t2a t3

But the TS: t1 t2 t3 t2a
should not be rewritten, and it is already correct


On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
alan.woodw...@romseysoftware.co.uk> wrote:

> Have a look at ShingleFilter:
>
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html
>
> On 21 Dec 2012, at 08:42, Xi Shen wrote:
>
> > I have to use the white space and word delimiter to process the input
> > first. I tried many combination, and it seems to me that it is
inevitable
> > the term will be split into two :(
> >
> > I think developing my own filter is the only resolution...but I just
> cannot
> > find a guide to help me understand what I need to do to implement a
> > TokenFilter.
> >
> >
> > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN 
wrote:
> >
> >> Easiest way would be to pre-process your input and join those 2 
> >> tokens

> >> before splitting them by white space.
> >>
> >> But from given context I might miss some details...still worth a 
> >> shot.

> >>
> >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen 
wrote:
> >>
> >>> Hi,
> >>>
> >>> I am looking for a token filter that can combine 2 terms into 1? 
> >>> E.g.

> >>>
> >>> the input has been tokenized by white space:
> >>>
> >>> t1 t2 t2a t3
> >>>
> >>> I want a filter that output:
> >>>
> >>> t1 t2t2a t3
> >>>
> >>> I know it is a very special case, and I am thinking about develop a
> >> filter
> >>> of my own. But I cannot figure out which API I should use to look 
> >>> for

> >> terms
> >>> in a Token Stream.
> >>>
> >>> --
> >>> Regards,
> >>> David Shen
> >>>
> >>> http://about.me/davidshen
> >>> https://twitter.com/#!/davidshen84
> >>>
> >>
> >
> >
> >
> > --
> > Regards,
> > David Shen
> >
> > http://about.me/davidshen
> > https://twitter.com/#!/davidshen84
>
>


--
Regards,
David Shen

http://about.me/davidshen
https://twitter.com/#!/davidshen84




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which token filter can combine 2 terms into 1?

2012-12-21 Thread Jack Krupansky
You still have the query parser's parsing before analysis to deal with, no 
matter what magic you code in your analyzer.


-- Jack Krupansky

-Original Message- 
From: Tom

Sent: Friday, December 21, 2012 2:24 PM
To: java-user@lucene.apache.org
Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky 
wrote:



And to be more specific, most query parsers will have already separated
the terms and will call the analyzer with only one term at a time, so no
term recombination is possible for those parsed terms, at query time.


Most analyzers will do that, yes. But if Xi writes his own analyzer with
his own combiner filter, then he should also use this for query generation
and thus get the desired combinations / snippets there as well.

Xi, here is the recipe:
- SnippetFilter extends TokenFilter
-SnippetFilter  needs access to your lexicon: a data structure to store
your snippets. In the general case this is a tree, and going along a branch
will tell you whenever a valid snipped has been built or if the snipped
could be longer. (Example: "internal revenue" can be one snippet but,
depending on the next token, a larger snipped of "internal revenue service"
could be built.)
- Logic of the SnippetFilter.incrementToken() goes something like this: You
need a loop which retrieves tokens from the input variable until the input
is empty. You store each retrieved token in a variable(s) x in
SnippetFilter . As long as you have a potential match against your lexicon,
you can continue in this loop. Once you realize that there is something
within x which can not possibly become a (longer) snippet, break out of the
loop and allow the consumer to retrieve it.
- make sure your analyzer inserts SnippetFilter at the correct spot in the
filter chain.

Cheers
FiveMileTom







-- Jack Krupansky
-Original Message- From: Erick Erickson
Sent: Friday, December 21, 2012 8:27 AM
To: java-user
Subject: Re: Which token filter can combine 2 terms into 1?


If it's a fixed list and not excessively long, would synonyms work?

But if theres some kind of logic you need to apply, I don't think you're
going to find anything OOB.
The problem is that by the time a token filter gets called, they are
already split up, you'll probably
have to write a custom filter that manages that logic.

Best
Erick


On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen  wrote:

 Unfortunately, no...I am not combine every two term into one. I am

combining a specific pair.

E.g. the Token Stream: t1 t2 t2a t3
should be rewritten into t1 t2t2a t3

But the TS: t1 t2 t3 t2a
should not be rewritten, and it is already correct


On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
alan.woodward@romseysoftware.**co.uk 
>

wrote:

> Have a look at ShingleFilter:
>
http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**
lucene/analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>
> On 21 Dec 2012, at 08:42, Xi Shen wrote:
>
> > I have to use the white space and word delimiter to process the input
> > first. I tried many combination, and it seems to me that it is
inevitable
> > the term will be split into two :(
> >
> > I think developing my own filter is the only resolution...but I just
> cannot
> > find a guide to help me understand what I need to do to implement a
> > TokenFilter.
> >
> >
> > On Fri, Dec 21, 2012 at 4:03 PM, Danil ŢORIN 
wrote:
> >
> >> Easiest way would be to pre-process your input and join those 2 > >>
tokens
> >> before splitting them by white space.
> >>
> >> But from given context I might miss some details...still worth a >
>> shot.
> >>
> >> On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen 
wrote:
> >>
> >>> Hi,
> >>>
> >>> I am looking for a token filter that can combine 2 terms into 1? >
>>> E.g.
> >>>
> >>> the input has been tokenized by white space:
> >>>
> >>> t1 t2 t2a t3
> >>>
> >>> I want a filter that output:
> >>>
> >>> t1 t2t2a t3
> >>>
> >>> I know it is a very special case, and I am thinking about develop a
> >> filter
> >>> of my own. But I cannot figure out which API I should use to look >
>>> for
> >> terms
> >>> in a Token Stream.
> >>>
> >>> --
> >>> Regards,
> >>> David Shen
> >>>
> >>> http://about.me/davidshen
> >>> https://twitter.com/#!/**davidshen84<https://twitter.com/#!/davidshen84>
> >>>
> >>
> >
> >
> >
> > --
> >

Re: Retrieving granular scores back from Lucene/SOLR

2012-12-25 Thread Jack Krupansky
The Explanation tree returned by IndexSearcher#explain is as good as you are 
going to get, but is rather expensive. You are asking for a lot, so you 
should be prepared to pay for it.


See:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query, 
int)


-- Jack Krupansky

-Original Message- 
From: Vishwas Goel

Sent: Tuesday, December 25, 2012 11:30 PM
To: java-user@lucene.apache.org
Subject: Retrieving granular scores back from Lucene/SOLR

Hi,

I am looking to get a bit more information back from SOLR/Lucene about the
query/document pair scores. This would include field level scores, overall
text relevance score, Boost value, BF value etc.

Information could either be encoded in the score itself that Lucene/Solr
returns - ideally however i would want to return an array of scores back
from SOLR/Lucene. I understand that i could work with the debug output -
but i am looking to do this for the production use case.

Has anyone explored something like this? Does someone know, where is the
final score(not just the relevance score) computed?

Thanks,
Vishwas 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TokenFilter state question

2012-12-26 Thread Jack Krupansky
You need a "reset" method that calls the super reset to reset the parent 
state and then reset your own state.


http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html#reset()

You probably don't have one, so only the parent state gets reset.

-- Jack Krupansky

-Original Message- 
From: Jeremy Long

Sent: Wednesday, December 26, 2012 9:08 AM
To: java-user@lucene.apache.org
Subject: TokenFilter state question

Hello,

I'm still trying to figure out some of the nuances of Lucene and I have run
into a small issue. I have created my own custom analyzer which uses the
WhitespaceTokenizer and chains together the LowercaseFilter,
StopwordFilter, and my own custom filter (below). I am using this analyzer
when searching (i.e. it is the analyzer used in a QueryParser).

The custom analyzers purpose is to add tokens by concatenating the previous
word with the current word. So that if you were given "Spring Framework
Core" the resulting tokens would be "Spring SpringFramework Framework
FrameworkCore Core". My problem is that when my query text is "Spring
Framework Core" I end up with left-over state in
my TokenPairConcatenatingFilter (the previousWord is a member field). So if
I end up re-using my query parser on a subsequent search for "Apache
Struts" I end up with the token stream of "CoreApache Apache ApacheStruts
Struts". The Initial "core" was left over state.

The left over state from the initial query appears to arise because in my
initial loop that collects all of the tokens from the underlying stream
only collects a single token. So the processing is - we collect the token
"spring", we write "spring" out to the stream and move it to the
previousWord. Next, we are at the end of the stream and we have no more
words in the list so the filter returns false. At this time, the filter is
called again and "Framework" is collected... repeat until end of tokens
from the query is reached; however, "Core" is left in the previousWord
field. The filter would work correctly with no state being left over if all
of the tokens were collected at the beginning (i.e. the first call to
incrementToken). Can anyone explain why all of the tokens would not be
collected and/or a work around so that when
QueryParser.parse("field:(Spring Framework Core)") is called residual state
is not left over in my token filter? I have two hack solutions - 1) don't
reuse the analyzer/QueryParser for subsequent queries or 2) build in a
reset mechanism to clear the previousWord field. I don't like either
solution and was hoping someone from the list might have a suggestion as to
what I've done wrong or some feature of Lucene I've missed. The code is
below.

Thanks in advance,

Jeremy



//
// TokenPairConcatenatingFilter

import java.io.IOException;
import java.util.LinkedList;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

/**
* Takes a TokenStream and adds additional tokens by concatenating pairs
of words.
* Example: "Spring Framework Core" -> "Spring SpringFramework
Framework FrameworkCore Core".
*
* @author Jeremy Long (jeremy.l...@gmail.com)
*/
public final class TokenPairConcatenatingFilter extends TokenFilter {

   private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
   private final PositionIncrementAttribute posIncAtt =
addAttribute(PositionIncrementAttribute.class);
   private String previousWord = null;
   private LinkedList words = null;

   public TokenPairConcatenatingFilter(TokenStream stream) {
   super(stream);
   words = new LinkedList();
   }

   /**
* Increments the underlying TokenStream and sets CharTermAtttributes to
* construct an expanded set of tokens by concatenting tokens with the
* previous token.
*
* @return whether or not we have hit the end of the TokenStream
* @throws IOException is thrown when an IOException occurs
*/
   @Override
   public boolean incrementToken() throws IOException {

   //collect all the terms into the words collaction
   while (input.incrementToken()) {
   String word = new String(termAtt.buffer(), 0, termAtt.length());
   words.add(word);
   }

   //if we have a previousTerm - write it out as its own token
concatonated
   // with the current word (if one is available).
   if (previousWord != null && words.size() > 0) {
   String word = words.getFirst();
   clearAttributes();
   termAtt.append(previousWord).append(word);
   posIncAtt.setPositionIncrement(0);
   previousWord = null;
  

Re: Which token filter can combine 2 terms into 1?

2012-12-26 Thread Jack Krupansky
Ah! You're quoting full phrases. You weren't clear about that originally. 
Thanks for the clarification.


-- Jack Krupansky

-Original Message- 
From: Tom

Sent: Wednesday, December 26, 2012 5:54 PM
To: java-user@lucene.apache.org
Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 2:44 PM, Jack Krupansky 
wrote:



You still have the query parser's parsing before analysis to deal with, no
matter what magic you code in your analyzer.



Not quite.
"query parser's parsing" comes first, you are correct on that. But it is
irrelevant for splitting field values into search terms, because this part
of the whole process is done by an analyzer. Therefore, if you make sure
the correct analyzer is used, then the parsing and splitting into
individual search terms will be done by this analyzer, not by the query
parser.

Try it: Implement an analyzer with the SnippetFilter below. Start Luke and
make sure this analyzer is selected in "Analyzer to use for query parsing".
In the search expression, type in any length of text for example:

body:"word1 word2 word3"

and you will get the possibly combined Terms.

For example, let's say one snipped in your SnippetFilter is: "word2 word3"
you will get

Term 0: field=body text=word1
Term 1: field=body text=word2 word3

In this case, word2 and word3 will NOT be split.





-- Jack Krupansky

-Original Message- From: Tom
Sent: Friday, December 21, 2012 2:24 PM
To: java-user@lucene.apache.org

Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky *
*wrote:

 And to be more specific, most query parsers will have already separated

the terms and will call the analyzer with only one term at a time, so no
term recombination is possible for those parsed terms, at query time.

 Most analyzers will do that, yes. But if Xi writes his own analyzer with

his own combiner filter, then he should also use this for query generation
and thus get the desired combinations / snippets there as well.

Xi, here is the recipe:
- SnippetFilter extends TokenFilter
-SnippetFilter  needs access to your lexicon: a data structure to store
your snippets. In the general case this is a tree, and going along a 
branch

will tell you whenever a valid snipped has been built or if the snipped
could be longer. (Example: "internal revenue" can be one snippet but,
depending on the next token, a larger snipped of "internal revenue 
service"

could be built.)
- Logic of the SnippetFilter.incrementToken() goes something like this: 
You

need a loop which retrieves tokens from the input variable until the input
is empty. You store each retrieved token in a variable(s) x in
SnippetFilter . As long as you have a potential match against your 
lexicon,

you can continue in this loop. Once you realize that there is something
within x which can not possibly become a (longer) snippet, break out of 
the

loop and allow the consumer to retrieve it.
- make sure your analyzer inserts SnippetFilter at the correct spot in the
filter chain.

Cheers
FiveMileTom







-- Jack Krupansky
-Original Message- From: Erick Erickson
Sent: Friday, December 21, 2012 8:27 AM
To: java-user
Subject: Re: Which token filter can combine 2 terms into 1?


If it's a fixed list and not excessively long, would synonyms work?

But if theres some kind of logic you need to apply, I don't think you're
going to find anything OOB.
The problem is that by the time a token filter gets called, they are
already split up, you'll probably
have to write a custom filter that manages that logic.

Best
Erick


On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen  wrote:

 Unfortunately, no...I am not combine every two term into one. I am


combining a specific pair.

E.g. the Token Stream: t1 t2 t2a t3
should be rewritten into t1 t2t2a t3

But the TS: t1 t2 t3 t2a
should not be rewritten, and it is already correct


On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
alan.woodward@romseysoftware.co.uk >>

wrote:

> Have a look at ShingleFilter:
>
http://lucene.apache.org/core/3_6_0/api/all/org/apache/**<http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**>
lucene/analysis/shingle/ShingleFilter.html<http://**
lucene.apache.org/core/3_6_0/**api/all/org/apache/lucene/**
analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>

>
> On 21 Dec 2012, at 08:42, Xi Shen wrote:
>
> > I have to use the white space and word delimiter to process the 
> > input

> > first. I tried many combination, and it seems to me that it is
inevitable
> > the term will be split into two :(
> >
> > I think developing my own filter is the only resolution...but I just
> cannot
> > find a guide to help me unders

Re: TokenFilter state question

2012-12-26 Thread Jack Krupansky


If you want the query parser to present a sequence of tokens to your 
analyzer as one unit, you need to enclose the sequence in quotes. Otherwise, 
the query parser will pass each of the terms to the analyzer as a single 
unit, with a reset for each, as you in fact report.


So, change:

   String querystr = "product:(Spring Framework Core) 
vendor:(SpringSource)";


to

   String querystr = "product:\"Spring Framework Core\" 
vendor:(SpringSource)";


-- Jack Krupansky

-Original Message- 
From: Jeremy Long

Sent: Wednesday, December 26, 2012 5:52 PM
To: java-user@lucene.apache.org
Subject: Re: TokenFilter state question

Actually, I had thought the same thing and had played around with the reset
method. However, in my example where I called the custom analyzer the
"reset" method was called after every token, not after the end of the
stream (which implies each token was treated as its own TokenStream?).

   String querystr = "product:(Spring Framework Core)
vendor:(SpringSource)";

   Map fieldAnalyzers = new HashMap();

   fieldAnalyzers.put("product", new
SearchFieldAnalyzer(Version.LUCENE_40));
   fieldAnalyzers.put("vendor", new
SearchFieldAnalyzer(Version.LUCENE_40));

   PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(
   new StandardAnalyzer(Version.LUCENE_40), fieldAnalyzers);
   QueryParser parser = new QueryParser(Version.LUCENE_40, field1,
wrapper);

   Query q = parser.parse(querystr);

In the above example (from the code snippets in the original email), if I
were to add a reset method it would be called 4 times (yes, I have tried
this). Reset gets called for each token "Spring" "Framework" "Core" and
"SpringSource". Thus, if I reset my internal state I would not achieve the
goal of have "Spring Framework Core" result in the tokens "Spring
SpringFramework Framework FrameworkCore Core". My question is - why would
these be treated as separate token streams?

--Jeremy

On Wed, Dec 26, 2012 at 10:54 AM, Jack Krupansky 
wrote:



You need a "reset" method that calls the super reset to reset the parent
state and then reset your own state.

http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/**
analysis/TokenStream.html#**reset()<http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html#reset()>

You probably don't have one, so only the parent state gets reset.

-- Jack Krupansky

-Original Message- From: Jeremy Long
Sent: Wednesday, December 26, 2012 9:08 AM
To: java-user@lucene.apache.org
Subject: TokenFilter state question


Hello,

I'm still trying to figure out some of the nuances of Lucene and I have 
run

into a small issue. I have created my own custom analyzer which uses the
WhitespaceTokenizer and chains together the LowercaseFilter,
StopwordFilter, and my own custom filter (below). I am using this analyzer
when searching (i.e. it is the analyzer used in a QueryParser).

The custom analyzers purpose is to add tokens by concatenating the 
previous

word with the current word. So that if you were given "Spring Framework
Core" the resulting tokens would be "Spring SpringFramework Framework
FrameworkCore Core". My problem is that when my query text is "Spring
Framework Core" I end up with left-over state in
my TokenPairConcatenatingFilter (the previousWord is a member field). So 
if

I end up re-using my query parser on a subsequent search for "Apache
Struts" I end up with the token stream of "CoreApache Apache ApacheStruts
Struts". The Initial "core" was left over state.

The left over state from the initial query appears to arise because in my
initial loop that collects all of the tokens from the underlying stream
only collects a single token. So the processing is - we collect the token
"spring", we write "spring" out to the stream and move it to the
previousWord. Next, we are at the end of the stream and we have no more
words in the list so the filter returns false. At this time, the filter is
called again and "Framework" is collected... repeat until end of tokens
from the query is reached; however, "Core" is left in the previousWord
field. The filter would work correctly with no state being left over if 
all

of the tokens were collected at the beginning (i.e. the first call to
incrementToken). Can anyone explain why all of the tokens would not be
collected and/or a work around so that when
QueryParser.parse("field:(**Spring Framework Core)") is called residual
state
is not left over in my token filter? I have two hack solutions - 1) don't
reuse the analyzer/QueryParser for subsequent queries or 2) build in a
reset mechanism to clear the previousWord field. I don't like either
solution and was hoping 

Re: Differences in MLT Query Terms Question

2013-01-08 Thread Jack Krupansky
The term "arv" is on the first list, but not the second. Maybe it's document 
frequency fell below the setting for minimum document frequency on the 
second run.


Or, maybe the minimum word length was set to 4 or more on the second run.

Are you using MoreLikeThisQuery or directly using MoreLikeThis?

Or, possibly "arv" appears later in a document on the second run, after the 
number of tokens specified by maxNumTokensParsed.


-- Jack Krupansky

-Original Message- 
From: Peter Lavin

Sent: Tuesday, January 08, 2013 1:46 PM
To: java-user@lucene.apache.org
Subject: Differences in MLT Query Terms Question


Dear Users,

I am running some simple experiments with Lucene and am seeing something
I don't understand.

I have 16 text files on 4 different topics, ranging in size from 50-900
KB.  When I index all 16 of these and run an MLT query based on one of
the indexed documents, I get an expected result (i.e. similar topics are
found).

When I reduce the number of text files to 4 and index them (having taken
care to overwriting the previous index files), and then run the same MLT
query (based on the same document from the index), I get slightly
different scores. I'm assuming this is because the IDF is now different
because there is less documents.

For each run, I have set the max number of terms as...
mlt.setMaxQueryTerms(100)

However, when I compare the terms which get used for the MLT query on
the 16 document index and the 4 document index, they are slightly
different. I've printed, parsed and sorted them into two columns of a
CSV file. I've pasted a small part of it at the end of this email.

My Question(s)...
1) Can anybody explain why the set of terms used for the MLT query is
different when a file from an index of 16 documents versus 4 documents
is used?

2) Am I right in assuming that the reason for slightly different scores
in the IDF, or could it be this slight difference in the sets of terms
used (or possibly both)?

regards,
Peter


--
with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536



"about","about"
"affordable","affordable"
"agents","agents"
"aids","aids"
"architecture","architecture"
"arv","based"
"based","blog"
"blog","board"
"board","business"
"business","care"
"care","commemorates"
"commemorates","contacts"
"contacts","contributions"
"contributions","coordinating"
"coordinating","core"
"core","countries"
"countries","country"
"country","data"
"data","decisions"
"decisions","details"
"details","disbursements"
"disbursements","documents"
"documents","donors"

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FuzzyQuery in lucene 4.0

2013-01-09 Thread Jack Krupansky
FWIW, new FuzzyQuery(term, 2 ,0) is the same as new FuzzyQuery(term), given 
the current values of defaultMaxEdits (2) and defaultPrefixLength (0).


-- Jack Krupansky

-Original Message- 
From: Ian Lea

Sent: Wednesday, January 09, 2013 9:44 AM
To: java-user@lucene.apache.org
Subject: Re: FuzzyQuery in lucene 4.0

See the javadocs for FuzzyQuery to see what the parameters are.  I
can't tell you what the comment means.  Possible values to try maybe?


--
Ian.


On Wed, Jan 9, 2013 at 2:34 PM, algebra  wrote:

is true Ian, o code is good.

The only thing that I dont understand is a line:

Query query = new FuzzyQuery(term, 2 ,0); //0-2

Whats means 0 to 2?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/FuzzyQuery-in-lucene-4-0-tp4031871p4031879.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene-MoreLikethis

2013-01-15 Thread Jack Krupansky
There are lots of parameters you can adjust, but the defaults essentially 
assume that you have a fairly large corpus and aren't interested in 
low-frequency terms.


So, try MoreLikeThis#setMinDocFreq. The default is 5. You don't have any 
terms in your example with a doc freq over 2.


Also, try setMinTermFreq. The default is 2. You don't have any terms with a 
term frequency above 1.


-- Jack Krupansky

-Original Message- 
From: Thomas Keller

Sent: Tuesday, January 15, 2013 3:22 PM
To: java-user@lucene.apache.org
Subject: Lucene-MoreLikethis

Hey,

I have a question about "MoreLikeThis" in Lucene, Java. I built up an index 
and want to find similar documents. But I always get no results for my 
query, mlt.like(1) is always empty. Can anyone find my mistake? Here is an 
example. (I use Lucene 4.0)


public class HelloLucene {
 public static void main(String[] args) throws IOException, ParseException 
{


  StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
  Directory index = new RAMDirectory();
  IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, 
analyzer);


   IndexWriter w = new IndexWriter(index, config);
   addDoc(w, "Lucene in Action", "193398817");
   addDoc(w, "Lucene for Dummies", "55320055Z");
   addDoc(w, "Managing Gigabytes", "55063554A");
   addDoc(w, "The Art of Computer Science", "9900333X");
   w.close();

   // search
   IndexReader reader = DirectoryReader.open(index);
   IndexSearcher searcher = new IndexSearcher(reader);

   MoreLikeThis mlt = new MoreLikeThis(reader);
   Query query = mlt.like(1);
   System.out.println(searcher.search(query, 5).totalHits);
 }

 private static void addDoc(IndexWriter w, String title, String isbn) 
throws IOException {

   Document doc = new Document();
   doc.add(new TextField("title", title, Field.Store.YES));

   // use a string field for isbn because we don't want it tokenized
   doc.add(new StringField("isbn", isbn, Field.Store.YES));
   w.addDocument(doc);
 }
}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Combine two BooleanQueries by a SpanNearQuery.

2013-01-17 Thread Jack Krupansky
You need to express the "boolean" query solely in terms of SpanOrQuery and 
SpanNearQuery. If you can't, ... then it probably can't be done, but you 
should be able to.


How about starting with a plan English description of the problem you are 
trying to solve?


-- Jack Krupansky

-Original Message- 
From: Michel Conrad

Sent: Thursday, January 17, 2013 11:01 AM
To: java-user@lucene.apache.org
Subject: Combine two BooleanQueries by a SpanNearQuery.

Hi,

I am looking to get a combination of multiple subqueries.

What I want to do is to have two queries which have to be near one to 
another.


As an example:
Query1: (A AND (B OR C))
Query2: D

Then I want to use something like a SpanNearQuery to combine both (slop 5):
Both would then have to match and D should be within slop 5 to A, B or C.

So my question is if there is a query that combines two BooleanQuery
trees into a SpanNearQuery.
It would have to take the terms that match Query 1 and the terms that
match Query 2, and look if there is a combination within the required
slop.

Can I rewrite the BooleanQuery after parsing the query as a
MultiTermQuery, than wrap these in SpanMultiTermQueryWrapper, which
can be combined by the SpanNearQuery?

Best regards,
Michel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Combine two BooleanQueries by a SpanNearQuery.

2013-01-17 Thread Jack Krupansky
Currently there isn't. SpanNearQuery can take only other SpanQuery objects, 
which includes other spans, span terms, and span wrapped multi-term queries 
(e.g., wildcard, fuzzy query), but not Boolean queries.


But it does sound like a good feature request.

There is SpanNotQuery, so you can exclude terms from a span.

-- Jack Krupansky

-Original Message- 
From: Michel Conrad

Sent: Thursday, January 17, 2013 12:14 PM
To: java-user@lucene.apache.org
Subject: Re: Combine two BooleanQueries by a SpanNearQuery.

The problem I would like to solve is to have two queries that I will
get from the query parser (this could include wildcardqueries and
phrasequeries).
Both of these queries would have to match the document and as an
additional restriction I would like to add that a matching term from
the first query
is near a matching term from the second query.

So that you can search for instance for matching documents with
in your first query 'apple AND NOT computer'
and in your second 'monkey'
with a slop of 10 between the two queries, then it would be equivalent
to '"apple monkey"~10 AND NOT computer'.

I was wondering if there is a method to combine more complicated
queries in a similar way. (Some kind of generic solution)

Thanks for your help,

Michel

On Thu, Jan 17, 2013 at 5:14 PM, Jack Krupansky  
wrote:

You need to express the "boolean" query solely in terms of SpanOrQuery and
SpanNearQuery. If you can't, ... then it probably can't be done, but you
should be able to.

How about starting with a plan English description of the problem you are
trying to solve?

-- Jack Krupansky

-Original Message- From: Michel Conrad
Sent: Thursday, January 17, 2013 11:01 AM
To: java-user@lucene.apache.org
Subject: Combine two BooleanQueries by a SpanNearQuery.


Hi,

I am looking to get a combination of multiple subqueries.

What I want to do is to have two queries which have to be near one to
another.

As an example:
Query1: (A AND (B OR C))
Query2: D

Then I want to use something like a SpanNearQuery to combine both (slop 
5):

Both would then have to match and D should be within slop 5 to A, B or C.

So my question is if there is a query that combines two BooleanQuery
trees into a SpanNearQuery.
It would have to take the terms that match Query 1 and the terms that
match Query 2, and look if there is a combination within the required
slop.

Can I rewrite the BooleanQuery after parsing the query as a
MultiTermQuery, than wrap these in SpanMultiTermQueryWrapper, which
can be combined by the SpanNearQuery?

Best regards,
Michel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SpanNearQuery with two boundaries

2013-01-18 Thread Jack Krupansky

+1

I think that accurately states the semantics of the operation you want.

-- Jack Krupansky

-Original Message- 
From: Alan Woodward

Sent: Friday, January 18, 2013 1:08 PM
To: java-user@lucene.apache.org
Subject: Re: SpanNearQuery with two boundaries

Hi Igor,

You could try wrapping the two cases in a SpanNotQuery:
SpanNot(SpanNear(runs, cat, 10), SpanNear(runs, cat, 3))

That should return documents that have runs within 10 positions of cat, as 
long as they don't overlap with runs within 3 positions of cat.


Alan Woodward
www.flax.co.uk


On 18 Jan 2013, at 16:13, Igor Shalyminov wrote:


Hello!

I want to perform search queries like this one:
word:"dog" \1 word:"runs" (\3 \10) word:"cat"

It is thus something like SpanNearQuery, but with two boundaries - minimum 
and maximum distance between the terms (which in the \1-case would be 
equal).
Syntax (as above, fictional:) itself doesn't matter, I just want to know 
if one is able to build this type of query based on existing (Lucene 
4.0.0) query classes.


--
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing multiple fields with one document position

2013-01-21 Thread Jack Krupansky
Send the same input text to two different analyzers for two separate fields. 
The first analyzer emits only the first attribute. The second analyzer emits 
only the second attribute. The document position in one will correspond to 
the document position in the other.


-- Jack Krupansky

-Original Message- 
From: Igor Shalyminov

Sent: Monday, January 21, 2013 3:04 AM
To: java-user@lucene.apache.org
Subject: Indexing multiple fields with one document position

Hello!

When indexing text with position data, one just adds field do a document in 
the form of its name and value, and the indexer assigns it unique position 
in the index.

I wonder, if I have an entry with two attributes, say:

cat,

How do I store in the index two fields, "pos" and "number" with its values, 
pointing to the same position in the document?


--
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky
You may be able to use Tika directly without needing to choose the specific 
classes, although the latter may give you the specific data you need without 
the extra overhead.


You could take a look at the Solr Extracting Request Handler source for an 
example:

http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/

Basically, Tika extracts a bunch of "metadata" and then you will have to add 
selected metadata to your Lucene documents. "content" is the main document 
body text.


You could try Solr itself to see how it works:
http://wiki.apache.org/solr/ExtractingRequestHandler

-- Jack Krupansky

-Original Message- 
From: Adrien Grand

Sent: Sunday, January 27, 2013 12:53 PM
To: java-user@lucene.apache.org
Subject: Re: Readers for extracting textual info from pd/doc/excel for 
indexing the actual content


Have you tried using the PDFParser [1] and the OfficeParser [2]
classes from Tika?

This question seems to be more appropriate for the Tika user mailing list 
[3]?


[1] 
http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream,

org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[2] 
http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream,

org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[3] http://tika.apache.org/mail-lists.html

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky
Re-read my last message - and then take a look at that Solr source code, 
which will give you an idea how to use Tika, even though you are using 
Lucene only. If you have specific questions, please be specific.


To answer your latest question, yes, Tika is good enough. Solr 
/update/extract uses it, and Solr is based on Lucene.


-- Jack Krupansky

-Original Message- 
From: saisantoshi

Sent: Sunday, January 27, 2013 2:09 PM
To: java-user@lucene.apache.org
Subject: Re: Readers for extracting textual info from pd/doc/excel for 
indexing the actual content


We are not using Solr and using just Lucene core 4.0 engine. I am trying to
see if we can use tika library to extract textual information from
pdf/word/excel documents. I am mainly interested in reading the contents
inside the documents and index using lucene. My question here is , is tika
framework good enough or is there any other better library. Any
issues/experiences in using the tika framework.

Thanks,
Sai.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Readers-for-extracting-textual-info-from-pd-doc-excel-for-indexing-the-actual-content-tp4036379p4036557.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions about FuzzyQuery in Lucene 4.x

2013-01-28 Thread Jack Krupansky
Let's see your code that calls FuzzyQuery . If you happen to pass a 
prefixLength (3rd parameter) of 3 or more, then "ster" would not match 
"star" (but prefixLength of 2 would match).


-- Jack Krupansky

-Original Message- 
From: George Kelvin

Sent: Monday, January 28, 2013 5:31 PM
To: java-user@lucene.apache.org
Subject: Questions about FuzzyQuery in Lucene 4.x

Hi All,

I’m working on several projects requiring powerful search features. I’ve
been waiting for Lucene 4 since I read Michael McCandless's blog post about
Lucene’s new FuzzyQuery and finally I got the chance to test it. The
improvement on the fuzzy search is really impressive!

However, I’ve encountered a problem today:

In my data, there are some records with keywords “star” and “wars”. But
when I issued a fuzzy query with two keywords “ster” and “wats”, the engine
failed to find the records.

I’m wondering if you can provide any inputs on that. Maybe I’m not doing
fuzzy search in the right way. But all my other fuzzy queries with single
keyword and with longer double keywords worked perfectly.

Another issue is that I’m also exploring the possibility to do
wildcard+fuzzy search using Lucene. I couldn’t find any related document
for this on Lucene's website, but I found a stackoverfow thread talking
about this.

http://stackoverflow.com/questions/2631206/lucene-query-bla-match-words-that-start-with-something-fuzzy-how

I tried the way suggested by the second answer their, and it worked.
However the scoring is strange: all results were assigned with the exactly
same score. Is there anything I can do to get the scoring right?

Can you tell me what’s the best way to do wildcard fuzzy search?

Any help will be appreciated!

Thanks,

George 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions about FuzzyQuery in Lucene 4.x

2013-01-29 Thread Jack Krupansky

That depends on the value of "ed", and the indexed data.

Another factor to take into consideration is that a case change ("Star" vs. 
"star") also counts as an edit.


-- Jack Krupansky

-Original Message- 
From: George Kelvin

Sent: Tuesday, January 29, 2013 11:49 AM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x

Hi Jack,

Thanks for your reply!

I don't think I passed the prefixLength parameter in.

Here is the code I used to build the FuzzyQuery:

   String[] words = str.split("\\+");
   BooleanQuery query = new BooleanQuery();

   for (int i=0; iGeorge 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions about FuzzyQuery in Lucene 4.x

2013-01-29 Thread Jack Krupansky
I also noticed that you have "MUST" for your full string of fuzzy terms - 
that means everyone of them must appear in an indexed document to be 
matched. Is it possible that maybe even one term was not in the same indexed 
document?


Try to provide a complete example that shows the input data and the query - 
all the literals. In other words, construct a minimal test case that shows 
the failure.


-- Jack Krupansky

-Original Message- 
From: George Kelvin

Sent: Tuesday, January 29, 2013 12:28 PM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x

Hi Jack,

ed is set to 1 here and I have lowercased all the data and queries.

Regarding the indexed data factor you mentioned, can you elaborate more?

Thanks!

George


On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky 
wrote:



That depends on the value of "ed", and the indexed data.

Another factor to take into consideration is that a case change ("Star"
vs. "star") also counts as an edit.

-- Jack Krupansky

-Original Message- From: George Kelvin
Sent: Tuesday, January 29, 2013 11:49 AM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x


Hi Jack,

Thanks for your reply!

I don't think I passed the prefixLength parameter in.

Here is the code I used to build the FuzzyQuery:

   String[] words = str.split("\\+");
   BooleanQuery query = new BooleanQuery();

   for (int i=0; iTo unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions about FuzzyQuery in Lucene 4.x

2013-01-29 Thread Jack Krupansky
I'm sorry, but for anybody to help you here, you really need to be able to 
provide a concise test case, like 10-20 lines of code, completely 
self-contained. If you think you need a million documents to repro what you 
claimed was a simple scenario, then you leave me very, very confused - and 
unable to help you any further.


-- Jack Krupansky

-Original Message- 
From: George Kelvin

Sent: Tuesday, January 29, 2013 2:43 PM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x

Hi Jack,

The problematic query is "scar"+"wads".

There are several (more than 10) documents in the data with the content
"star wars", so I think that query should be able to find all these
documents.

I was trying to provide a minimal test case, but I couldn't reduce the size
of data showing the failure.

The size of the minimal data showing the failure I got so far is around 2
million.

However, I found a suspicious document with content "scor". If I remove it
from the 2 million documents data, that query can find all the "star wars"
documents. If I add it back, then the query can't find any.

I tried to reduce the size of the data to 1 million further and add that
"scor" document, but now the query can still find all the "star wars"
documents.

Is it possible that Lucene somehow fail to find all the valid terms within
the edit distance?

Thanks!

George


On Tue, Jan 29, 2013 at 10:02 AM, Jack Krupansky 
wrote:



I also noticed that you have "MUST" for your full string of fuzzy terms -
that means everyone of them must appear in an indexed document to be
matched. Is it possible that maybe even one term was not in the same
indexed document?

Try to provide a complete example that shows the input data and the query
- all the literals. In other words, construct a minimal test case that
shows the failure.


-- Jack Krupansky

-Original Message- From: George Kelvin
Sent: Tuesday, January 29, 2013 12:28 PM

To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x

Hi Jack,

ed is set to 1 here and I have lowercased all the data and queries.

Regarding the indexed data factor you mentioned, can you elaborate more?

Thanks!

George


On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky *
*wrote:

 That depends on the value of "ed", and the indexed data.


Another factor to take into consideration is that a case change ("Star"
vs. "star") also counts as an edit.

-- Jack Krupansky

-Original Message- From: George Kelvin
Sent: Tuesday, January 29, 2013 11:49 AM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x


Hi Jack,

Thanks for your reply!

I don't think I passed the prefixLength parameter in.

Here is the code I used to build the FuzzyQuery:

   String[] words = str.split("\\+");
   BooleanQuery query = new BooleanQuery();

   for (int i=0; i
>
For additional commands, e-mail: java-user-help@lucene.apache.org<
java-user-help@lucene.**apache.org >





--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to find related words ?

2013-01-30 Thread Jack Krupansky

Take a look at MoreLikeThisQuery:
http://lucene.apache.org/core/4_1_0/queries/org/apache/lucene/queries/mlt/MoreLikeThisQuery.html

And MoreLikeThis itself:
http://lucene.apache.org/core/4_1_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html

So, the idea is search for documents using your keyword(s) and ask Lucene to 
extract relevant terms from the top document(s).


-- Jack Krupansky

-Original Message- 
From: wgggfiy

Sent: Wednesday, January 30, 2013 12:27 PM
To: java-user@lucene.apache.org
Subject: How to find related words ?

In short,
you put in a term like "Lucene",
and The ideal output would be "solr", "index", "full-text search", and so
on.
How to make it ? to find the related words. thx
My idea is to use FuzzyQuery, or MoreLikeThis, or calc the score with all
the terms and then sort.
Any idea ?



-
--
Email: wuqiu.m...@qq.com
--
--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-find-related-words-tp4037462.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to find related words ?

2013-01-31 Thread Jack Krupansky
Oh, so you wanted "similar" words! You should have said so... your inquiry 
said you were looking for "related" words. So, which is it? More 
specifically, what exactly are you looking for, in terms of the semantics? 
In any case, "find similar" (MoreLikeThis) is about the best you can do out 
of the box.


-- Jack Krupansky

-Original Message- 
From: Andrew Gilmartin

Sent: Thursday, January 31, 2013 9:04 AM
To: java-user@lucene.apache.org
Subject: Re: How to find related words ?

wgggfiy wrote:

en, it seems nice, but I'm puzzled by you and Andrew Gilmartina above,
what's the difference between you guys ?


The different is that similar documents do not give you similar terms. 
Similar documents can show a correlation of terms -- ie, whereever Lucene is 
mentioned so is Solr and Hadoop -- but in no way does this mean that the 
terms are similar. Accumulating similar and/or synonymous terms is a manual 
process. I am sure there are text mining tools/algorithms that make 
discoveries, but I do not know about these. (I am a journeyman programmer 
not a researcher.) If anyone does know about them, please share with this 
list.


-- Andrew

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene vs Glimpse

2013-02-04 Thread Jack Krupansky
Generally, all of your example queries should work fine with Lucene, 
provided that you carefully choose your analyzer, or even use the 
StandardAnalyzer. The special characters like underscore and dot generally 
get treated as spaces and the resulting sequence of terms would match as a 
phrase. It won't be a 100% solution, but it should do reasonably well.


Is there a query that was failing to match reasonably for you?

-- Jack Krupansky

-Original Message- 
From: Mathias Dahl

Sent: Monday, February 04, 2013 1:01 PM
To: java-user@lucene.apache.org
Subject: Lucene vs Glimpse

Hi,

I have hacked together a small web front end to the Glimpse text
indexing engine (see http://webglimpse.net/ for information). I am
very happy with how Glimpse indexes and searches data. If I understand
it correctly it uses a combination of an index and searching directly
in the files themselves as grep or other tools. The problem is that I
discovered it is not open source and now that I want to extend the use
from private to company wide I will run into license problems/costs.

So, I decided to try out Lucene. I tried the examples and changed them
a bit to use another analyzer. But when I started to think about it I
realized that I will not be able to build something like Glimpse. At
least not easily.

Why? I will try to explain:

As stated above, Glimpse uses a combination of index and in-file
search. This makes it very powerful in the sense that I can get hits
for things that are not necessarily being indexes as terms. Let's say
I have a file with this content:

...
import foo.bar.baz;
...

With Glimpse, and without telling it how to index the content I can
find the above file using a search string like "foo" or "bar" but
also, and this is important, using foo.bar.baz.

Another example:

We have a lot of PL/SQL source code, and often you can find code like this:

...
My_Nice_API.Some_Method
...

Here too, Glimpse is almost magic since it combines index and normal
search. I can find the file above using "My_Nice_API" or
"My_Nice_API.Some_Method".

In a sense I can have the cake and eat it too.

If I want to do similar "free" search stuff with Lucene I think I have
to create analyzers for the different kind of source code files, with
fields for this and that. Quite an undertaking.

Does anyone understand my point here and am I correct in that it would
be hard to implement something as "free" as with Glimpse? I am not
trying to critizise, just understand how Lucene (and Glimpse) works.

Oh, yes, Glimpse has one big drawback: it only supports search strings
up to 32 characters.

Thanks!

/Mathias

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Wildcard in a text field

2013-02-08 Thread Jack Krupansky
That description is too vague. Could you provide a couple of examples of 
index text and queries and what you expect those queries to match.


If you simply want to query for "*" and "?" in "string" fields, escape them 
with a backslash. But if you want to escape them in "text" fields, be sure 
to use an analyzer that preserves them since they generally will be treated 
as spaces.


-- Jack Krupansky

-Original Message- 
From: Nicolas Roduit

Sent: Friday, February 08, 2013 2:49 AM
To: java-user@lucene.apache.org
Subject: Wildcard in a text field

I'm looking for a way of making a query on words which contain wildcards
(* or ?). In general, we use wildcards in query, not in the text. I
haven't find anything in Lucene to build that.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Wildcard in a text field

2013-02-08 Thread Jack Krupansky
Ah, okay... some people call that "prospective" search. In any case, there 
is no direct Lucene support that I know of.


There are some references here:
http://lucene.apache.org/core/4_0_0/memory/org/apache/lucene/index/memory/MemoryIndex.html

-- Jack Krupansky

-Original Message- 
From: Nicolas Roduit

Sent: Friday, February 08, 2013 10:14 AM
To: java-user@lucene.apache.org
Subject: Re: Re: Wildcard in a text field

For instance, I have a list of tags related to a text. Each text with
its list of tags are put in a document and indexed by Lucene. If we
consider that a tag is "buddh*" and I would like to make a query (e.g.
"buddha" or "buddhism" or "buddhist") and find the document that contain
"buddh*".

Thanks,


Le 08. 02. 13 13:35, Jack Krupansky a écrit :
That description is too vague. Could you provide a couple of examples of 
index text and queries and what you expect those queries to match.


If you simply want to query for "*" and "?" in "string" fields, escape 
them with a backslash. But if you want to escape them in "text" fields, be 
sure to use an analyzer that preserves them since they generally will be 
treated as spaces.


-- Jack Krupansky

-Original Message- From: Nicolas Roduit
Sent: Friday, February 08, 2013 2:49 AM
To: java-user@lucene.apache.org
Subject: Wildcard in a text field

I'm looking for a way of making a query on words which contain wildcards
(* or ?). In general, we use wildcards in query, not in the text. I
haven't find anything in Lucene to build that.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: fuzzy queries

2013-02-09 Thread Jack Krupansky

You probably are not getting this document returned:

   list.add("strfffing_ m atcbbhing");

because... both terms have an edit distance greater than two.

All the other documents have one or the other or both terms with an editing 
distance of 2 or less.


Your query is essentially: Match a document if EITHER term matches. So, if 
NEITHER matches (within an editing distance of 2), the document is not a 
match.


-- Jack Krupansky

-Original Message- 
From: Pierre Antoine DuBoDeNa

Sent: Saturday, February 09, 2013 12:52 PM
To: java-user@lucene.apache.org
Subject: Re: fuzzy queries

with query like string~ matching~ (without specifying threshold) i get 14
results back..

Can it be problem with the analyzers?

Here is the code:

private File indexDir = new File("/a-directory-here");

private StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

private IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
analyzer);

public static void main(String[] args) throws Exception {

IndexProfiles Indexer = new IndexProfiles();

IndexWriter w = Indexer.CreateIndex();

ArrayList list = new ArrayList();

list.add("string matching");

list.add("string123 matching");

list.add("string matching123");

list.add("string123 matching123");

list.add("str4ing match2ing");

list.add("1string 2matching");

list.add("str_ing ma_tching");

list.add("string_matching");

list.add("strang mutching");

list.add("strrring maatchinng");

list.add("strfffing_ m atcbbhing");

list.add("str2ing__mat3ching");

list.add("string_m atching");

list.add("string matching another token");

list.add("strasding matc4hing ano23ther tok3en");

list.add("str4ing maaatching_another 2t oken");


 for (String companyname:list)

{

Indexer.addSingleField(w, companyname);

}


int numDocs = w.numDocs();

   System.out.println("# of Docs in Index: " + numDocs);

   w.close();


   DoIndexQuery("string~ matching~");

 }

public static void DoIndexQuery(String query) throws IOException,
ParseException {

IndexProfiles Indexer = new IndexProfiles();

   IndexReader reader = Indexer.LoadIndex();


Indexer.SearchIndex(reader, query, 50);



   reader.close();

}


public IndexWriter CreateIndex() throws IOException {



Directory index = FSDirectory.open(indexDir);

IndexWriter w = new IndexWriter(index, config);

return w;



}


public HashMap SearchIndex(IndexReader w, String query, int topk)
throwsIOException, ParseException {



 Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
).parse(query);



IndexSearcher searcher = new IndexSearcher(w);

TopScoreDocCollector collector = TopScoreDocCollector.create(topk, true);

searcher.search(q, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;



System.out.println("Found " + hits.length + " hits.");

HashMap map = new HashMap();

for(int i=0;i


Can you reduce your test case to indexing one document/field and
running a single FuzzyQuery (you seem to be running two at once,
OR'ing the results)?

And show the complete standalone source code (eg what is topk?) so we
can see how you are indexing / building the Query / searching.

The default minSim is 0.5.

Note that 0.01 is not useful in practice: it (should) match nearly all
terms.  But I agree it's odd one term is not matching.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Feb 9, 2013 at 5:20 AM, Pierre Antoine DuBoDeNa
 wrote:
>>
>> Hello,
>>
>> I use lucene 3.6 and i try to use fuzzy queries so that I can match 
>> much

>> more results.
>>
>> I am adding for example these strings:
>>
>>  list.add("string matching");
>>
>> list.add("string123 matching");
>>
>> list.add("string matching123");
>>
>> list.add("string123 matching123");
>>
>> list.add("str4ing match2ing");
>>
>> list.add("1string 2matching");
>>
>> list.add("str_ing ma_tching");
>>
>> list.add("string_matching");
>>
>> list.add("strang mutching");
>>
>> list.add("strrring maatchinng");
>>
>> list.add("strfffing_ m atcbbhing");
>>
>> list.add("str2ing__mat3ching");
>>
>> list.add("string_m atching");
>>
>> list.add("string matching another token");
>>
>> list.add("strasding matc4hing ano23ther tok3en");
>>
>> list.add("str4ing maaatching_another 2t oken");
>>
>>
>>
>> then i do a query:
>>
>>
>> "string~0.01 matching~0.01"
>&

Re: Grouping and tokens

2013-02-18 Thread Jack Krupansky
Please clarify exactly what you want to group by - give a specific example 
that makes it clear what terms should affect grouping and which shouldn't.


-- Jack Krupansky

-Original Message- 
From: Ramprakash Ramamoorthy

Sent: Monday, February 18, 2013 6:12 AM
To: java-user@lucene.apache.org
Subject: Grouping and tokens

Hello all,

From the grouping javadoc, I read that fields that are supposed to be
grouped should not be tokenized. I have an use case where the user has the
freedom to group by any field during search time.

Now that only tokenized fields are eligible for grouping, this is
creating an issue with my search. Say for instance the book "*Fifty shades
of grey*" when tokenized and searched for "*shades*" turns up in the
result. However this is not the case when I have it as a non-tokenized
field (using StandardAnalyzer-Version4.1).

How do I go about this? Is indexing a tokenized and non-tokenized
version of the same field the only go? I am afraid its way too costly!
Thanks in advance for your valuable inputs.

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping and tokens

2013-02-18 Thread Jack Krupansky
Okay, so, fields that would normally need to be tokenized must be stored as 
both raw strings for grouping and tokenized text for keyword search. Simply 
use copyField to copy from one to the other.


-- Jack Krupansky

-Original Message- 
From: Ramprakash Ramamoorthy

Sent: Monday, February 18, 2013 11:13 PM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky 
wrote:



Please clarify exactly what you want to group by - give a specific example
that makes it clear what terms should affect grouping and which shouldn't.



Assume I am indexing a library data. Say there are the following fields for
a particular book.
1. Published
2. Language
3. Genre
4. Author
5. Title
6. ISBN

While search time, the user can ask to group by any of the above
fields, which means all of them are not supposed to be tokenized. So as I
had told earlier, there is a book titled "Fifty shades of gray" and the
user searches for "shades". The result turns up in case the field is
tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear?

In a nutshell, how do I use a groupby on a field that is also
tokenized?



-- Jack Krupansky

-Original Message- From: Ramprakash Ramamoorthy
Sent: Monday, February 18, 2013 6:12 AM
To: java-user@lucene.apache.org
Subject: Grouping and tokens


Hello all,

From the grouping javadoc, I read that fields that are supposed to be
grouped should not be tokenized. I have an use case where the user has the
freedom to group by any field during search time.

Now that only tokenized fields are eligible for grouping, this is
creating an issue with my search. Say for instance the book "*Fifty shades
of grey*" when tokenized and searched for "*shades*" turns up in the

result. However this is not the case when I have it as a non-tokenized
field (using StandardAnalyzer-Version4.1).

How do I go about this? Is indexing a tokenized and non-tokenized
version of the same field the only go? I am afraid its way too costly!
Thanks in advance for your valuable inputs.

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India.
+91 9626975420 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping and tokens

2013-02-18 Thread Jack Krupansky
Oops, sorry for the "Solr" answer. In Lucene you need to simply index the 
same value, once as a raw string and a second time as a tokenized text 
field. Grouping would use the raw string version of the data.


-- Jack Krupansky

-Original Message----- 
From: Jack Krupansky

Sent: Monday, February 18, 2013 11:21 PM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

Okay, so, fields that would normally need to be tokenized must be stored as
both raw strings for grouping and tokenized text for keyword search. Simply
use copyField to copy from one to the other.

-- Jack Krupansky

-Original Message- 
From: Ramprakash Ramamoorthy

Sent: Monday, February 18, 2013 11:13 PM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky
wrote:


Please clarify exactly what you want to group by - give a specific example
that makes it clear what terms should affect grouping and which shouldn't.



Assume I am indexing a library data. Say there are the following fields for
a particular book.
1. Published
2. Language
3. Genre
4. Author
5. Title
6. ISBN

While search time, the user can ask to group by any of the above
fields, which means all of them are not supposed to be tokenized. So as I
had told earlier, there is a book titled "Fifty shades of gray" and the
user searches for "shades". The result turns up in case the field is
tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear?

In a nutshell, how do I use a groupby on a field that is also
tokenized?



-- Jack Krupansky

-Original Message- From: Ramprakash Ramamoorthy
Sent: Monday, February 18, 2013 6:12 AM
To: java-user@lucene.apache.org
Subject: Grouping and tokens


Hello all,

From the grouping javadoc, I read that fields that are supposed to be
grouped should not be tokenized. I have an use case where the user has the
freedom to group by any field during search time.

Now that only tokenized fields are eligible for grouping, this is
creating an issue with my search. Say for instance the book "*Fifty shades
of grey*" when tokenized and searched for "*shades*" turns up in the

result. However this is not the case when I have it as a non-tokenized
field (using StandardAnalyzer-Version4.1).

How do I go about this? Is indexing a tokenized and non-tokenized
version of the same field the only go? I am afraid its way too costly!
Thanks in advance for your valuable inputs.

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India.
+91 9626975420


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping and tokens

2013-02-19 Thread Jack Krupansky
Well, you don't need to "store" both copies since they will be the same. 
They both need to be "indexed" (string form for grouping, text form for 
keyword search), but only one needs to be "stored".


-- Jack Krupansky

-Original Message- 
From: Ramprakash Ramamoorthy

Sent: Tuesday, February 19, 2013 1:07 AM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

On Tue, Feb 19, 2013 at 12:57 PM, Jack Krupansky 
wrote:



Oops, sorry for the "Solr" answer. In Lucene you need to simply index the
same value, once as a raw string and a second time as a tokenized text
field. Grouping would use the raw string version of the data.

Yeah, thanks Jack. Was just wondering if there would be a better alternate

rather than 2x storing. But I don't see any. Thanks again.


-- Jack Krupansky

-Original Message- From: Jack Krupansky
Sent: Monday, February 18, 2013 11:21 PM

To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

Okay, so, fields that would normally need to be tokenized must be stored 
as
both raw strings for grouping and tokenized text for keyword search. 
Simply

use copyField to copy from one to the other.

-- Jack Krupansky

-Original Message- From: Ramprakash Ramamoorthy
Sent: Monday, February 18, 2013 11:13 PM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky
**wrote:

 Please clarify exactly what you want to group by - give a specific 
example
that makes it clear what terms should affect grouping and which 
shouldn't.



Assume I am indexing a library data. Say there are the following fields 
for

a particular book.
1. Published
2. Language
3. Genre
4. Author
5. Title
6. ISBN

While search time, the user can ask to group by any of the above
fields, which means all of them are not supposed to be tokenized. So as I
had told earlier, there is a book titled "Fifty shades of gray" and the
user searches for "shades". The result turns up in case the field is
tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear?

    In a nutshell, how do I use a groupby on a field that is also
tokenized?



-- Jack Krupansky

-Original Message- From: Ramprakash Ramamoorthy
Sent: Monday, February 18, 2013 6:12 AM
To: java-user@lucene.apache.org
Subject: Grouping and tokens


Hello all,

From the grouping javadoc, I read that fields that are supposed to be
grouped should not be tokenized. I have an use case where the user has 
the

freedom to group by any field during search time.

Now that only tokenized fields are eligible for grouping, this is
creating an issue with my search. Say for instance the book "*Fifty 
shades

of grey*" when tokenized and searched for "*shades*" turns up in the

result. However this is not the case when I have it as a non-tokenized
field (using StandardAnalyzer-Version4.1).

How do I go about this? Is indexing a tokenized and non-tokenized
version of the same field the only go? I am afraid its way too costly!
Thanks in advance for your valuable inputs.

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420

--**
--**-
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<
java-user-**unsubscr...@lucene.apache.org
>
For additional commands, e-mail: java-user-help@lucene.apache.org<
java-user-help@lucene.**apache.org >





--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India.
+91 9626975420


--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org


--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: possible bug on Spellchecker

2013-02-20 Thread Jack Krupansky

Any reason that you are not using the DirectSpellChecker?

See:
http://lucene.apache.org/core/4_0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html

-- Jack Krupansky

-Original Message- 
From: Samuel García Martínez

Sent: Wednesday, February 20, 2013 3:34 PM
To: java-user@lucene.apache.org
Subject: possible bug on Spellchecker

Hi all,

Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene
Spellchecker) behaviour i think i found a bug when the input is a 6 letter
word:
 - george
 - anthem
 - argued
 - fluent

Due to the getMin() and getMax() the grams indexed for these terms are 3
and 4. So, the fields would be something like this:
 - for "*george*"
- start3: "geo"
- start4: "geor"
- end3: "rge"
- end4: "orge"
- 3: "geo", "eor", "org", "rge"
- 4: "geor", "eorg", "orge"
 - for "*anthem*"
- start3: "ant"
- start4: "anth"
- end3: "tem"
- end4: "them"

The problem shows up when the user swap 3rd a 4th characters, misspelling
the word like this:
 - geroge
 - anhtem

The queries generated for this terms are: (SHOULD boolean queries)
- for "*geroge*"
 - start3: "ger"
 - start4: "gero"
 - end3: "oge"
 - end4: "roge"
 - 3: "ger", "ero", "rog", "oge"
 - 4: "gero", "erog", "roge"
- for "*anhtem*"
 - start3: "anh"
 - start4: "anht"
 - end3: "tem"
 - end4: "htem"
 - 3: "anh", "nht", "hte", "tem"
 - 4: "anht", "nhte", "htem"

So, as you can see, this kind of misspelling never matches the suitable
suggestions although the edit distance is 0.9556.

I think getMin(int l) and getMax(int l) should return 2 and 3,
respectively, for l==6. Debugging other values i did not found any problem
with any kind of misspelling.

Any thoughts about this?

--
Un saludo,
Samuel García 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boolean Query not working in Lucene 4.0

2013-02-26 Thread Jack Krupansky
Try detailing both your expected behavior and the actual behavior. Try 
providing an actual code snippet and actual index and query data. Is it 
failing for all types and titles or just for some?


-- Jack Krupansky

-Original Message- 
From: saisantoshi

Sent: Tuesday, February 26, 2013 6:52 PM
To: java-user@lucene.apache.org
Subject: Boolean Query not working in Lucene 4.0

The following query does not seems to work after we upgrade from 2.4 - 4.0

*+type:sometype +title:sometitle**

Any ideas as to what are some of the places to look for? Is the above Query
correct in syntax. Appreciate if you could advise on the above?

Thanks,
Sai.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boolean-Query-not-working-in-Lucene-4-0-tp4043246.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting documents from suggestions

2013-03-14 Thread Jack Krupansky
Could you give us some examples of what you expect? I mean, how is your 
suggested set of documents any different from simply executing a query with 
the list of suggested terms (using q.op=OR)?


Or, maybe you want something like MoreLikeThis?

-- Jack Krupansky

-Original Message- 
From: Bratislav Stojanovic

Sent: Thursday, March 14, 2013 5:36 PM
To: java-user@lucene.apache.org
Subject: Getting documents from suggestions

Hi all,

How can I filter suggestions based on some value from the indexed field?
I have a stored 'id' field in my index and I want to use that to examine
documents
where the suggestion was found, but how to get Document from suggestion?
SpellChecker class only returns array of strings.

What classes should I use? Please help.

Thanx in advance.

--
Bratislav Stojanovic, M.Sc. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting documents from suggestions

2013-03-14 Thread Jack Krupansky

Let's refine this...

If a top suggestion is X, do you simply want to know a few of the documents 
which have the highest term frequency for X?


Or is there some other term-oriented metric you might propose?

-- Jack Krupansky

-Original Message- 
From: Bratislav Stojanovic

Sent: Thursday, March 14, 2013 6:14 PM
To: java-user@lucene.apache.org
Subject: Re: Getting documents from suggestions

Wow that was fast :)

I have implemented a simple search box with auto-suggestions, so whenever
user
types in something, ajax call is fired to the SuggestServlet and in return
10 suggestions
are shown. It's working fine with the SpellChecker class, but I only get
array of Strings.

What I want is to get lucene Document instances so I can use doc.get("id")
to filter those suggestions.
This field is my field, doesn't have to do anything with the default Doc.
Id field Lucene generates.

Here's an example : when I type "apache" I get suggestions like 
"apache.org",

"apache2" etc.
Now I want to have something like this :
Document doc = SomeClass.getDocFromSuggestion("apache.org");
if (doc.get("id") == ...) {
  //add suggestion into the result
} else {
  //do nothing.
}

Is MoreLikeThis designed for this?

On Thu, Mar 14, 2013 at 10:45 PM, Jack Krupansky 
wrote:



Could you give us some examples of what you expect? I mean, how is your
suggested set of documents any different from simply executing a query 
with

the list of suggested terms (using q.op=OR)?

Or, maybe you want something like MoreLikeThis?

-- Jack Krupansky

-Original Message- From: Bratislav Stojanovic
Sent: Thursday, March 14, 2013 5:36 PM
To: java-user@lucene.apache.org
Subject: Getting documents from suggestions


Hi all,

How can I filter suggestions based on some value from the indexed field?
I have a stored 'id' field in my index and I want to use that to examine
documents
where the suggestion was found, but how to get Document from suggestion?
SpellChecker class only returns array of strings.

What classes should I use? Please help.

Thanx in advance.

--
Bratislav Stojanovic, M.Sc.

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
Bratislav Stojanovic, M.Sc. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting documents from suggestions

2013-03-16 Thread Jack Krupansky
I don't have time right now to debug your code right now, but make sure that 
the analysis is consistent between index and query. For example, "Apache" 
vs. "apache".


-- Jack Krupansky

-Original Message- 
From: Bratislav Stojanovic

Sent: Saturday, March 16, 2013 7:29 AM
To: java-user@lucene.apache.org
Subject: Re: Getting documents from suggestions

Hey Jack,

I've tried MoreLikeTHis, but it always returns me 0 hits. Here's the code,
it's very simple :

// test2
Index lucene = null;
try {
lucene = new Index();
MoreLikeThis mlt = new MoreLikeThis(lucene.reader);
mlt.setAnalyzer(lucene.analyzer);
Reader target = new StringReader("apache");
Query query = mlt.like(target, "contents");
TopDocs results = lucene.searcher.search(query, 10);
ScoreDoc[] hits = results.scoreDocs;
System.out.println("Total "+hits.length+" hits");
for (int i = 0; i < hits.length; i++) {
Document doc = lucene.searcher.doc(hits[i].doc);
System.out.println("Hit "+i+" : "+doc.getField("id").stringValue());
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (lucene != null) lucene.close();
}

Here are my fields in the index :

...
Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
doc.add(pathField);

doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
// id so we can fetch metadata
doc.add(new LongField("id", id, Field.Store.YES));
// add default user (guest)
doc.add(new LongField("userid", -1L, Field.Store.YES));

doc.add(new TextField("contents", new BufferedReader(new
InputStreamReader(fis, "UTF-8";
...

Do you have any clue what might be wrong? Btw, searching for "apache"
returns me 36 hits, so index is fine.

P.S. I even tried like(int) and passing Doc. Id. but same thing.

On Thu, Mar 14, 2013 at 10:45 PM, Jack Krupansky 
wrote:



Could you give us some examples of what you expect? I mean, how is your
suggested set of documents any different from simply executing a query 
with

the list of suggested terms (using q.op=OR)?

Or, maybe you want something like MoreLikeThis?

-- Jack Krupansky

-Original Message- From: Bratislav Stojanovic
Sent: Thursday, March 14, 2013 5:36 PM
To: java-user@lucene.apache.org
Subject: Getting documents from suggestions


Hi all,

How can I filter suggestions based on some value from the indexed field?
I have a stored 'id' field in my index and I want to use that to examine
documents
where the suggestion was found, but how to get Document from suggestion?
SpellChecker class only returns array of strings.

What classes should I use? Please help.

Thanx in advance.

--
Bratislav Stojanovic, M.Sc.

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
Bratislav Stojanovic, M.Sc. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multi-value fields in Lucene 4.1

2013-03-22 Thread Jack Krupansky
I don't think there is a way of identifying which of the values of a 
multivalued field matched. But... I haven't checked the code to be 
absolutely certain whether their isn't some expert way.


Also, realize that multiple values could match, such as if you queried for 
"B*".


-- Jack Krupansky

-Original Message- 
From: Chris Bamford

Sent: Friday, March 22, 2013 5:57 AM
To: java-user@lucene.apache.org
Subject: Multi-value fields in Lucene 4.1

Hi,

If I index several similar values in a multivalued field (e.g. many authors 
to one book), is there any way to know which of these matched during a 
query?

e.g.

 Book "The art of Stuff", with authors "Bob Thingummy" and "Belinda 
Bootstrap"


If we queried for +(author:Be*) and matched this document, is there a way of 
drilling down and identifying the specific sub-field that actually triggered 
the match ("Belinda Bootstrap") ?  I was wondering what the lowest 
granularity of matching actually is - document / field / sub-field ...


I am happy to index with term vectors and positions if it helps.

Thanks,

- Chris 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Accent insensitive analyzer

2013-03-22 Thread Jack Krupansky

Try the ASCII Folding FIlter:
https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html

-- Jack Krupansky

-Original Message- 
From: Jerome Blouin

Sent: Friday, March 22, 2013 12:22 PM
To: java-user@lucene.apache.org
Subject: Accent insensitive analyzer

Hello,

I'm looking for an analyzer that allows performing accent insensitive search 
in latin languages. I'm currently using the StandardAnalyzer but it doesn't 
fulfill this need. Could you please point me to the one I need to use? I've 
checked the javadoc for the various analyzer packages but can't find one. Do 
I need to implement my own analyzer?


Regards,
Jerome


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Accent insensitive analyzer

2013-03-22 Thread Jack Krupansky

Start with the Standard Tokenizer:
https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html

-- Jack Krupansky

-Original Message- 
From: Jerome Blouin

Sent: Friday, March 22, 2013 12:53 PM
To: java-user@lucene.apache.org
Subject: RE: Accent insensitive analyzer

I understand that I can't configure it on an analyzer so on which class can 
I apply it?


Thank,
Jerome

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, March 22, 2013 12:38 PM
To: java-user@lucene.apache.org
Subject: Re: Accent insensitive analyzer

Try the ASCII Folding FIlter:
https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html

-- Jack Krupansky

-Original Message-
From: Jerome Blouin
Sent: Friday, March 22, 2013 12:22 PM
To: java-user@lucene.apache.org
Subject: Accent insensitive analyzer

Hello,

I'm looking for an analyzer that allows performing accent insensitive search 
in latin languages. I'm currently using the StandardAnalyzer but it doesn't 
fulfill this need. Could you please point me to the one I need to use? I've 
checked the javadoc for the various analyzer packages but can't find one. Do 
I need to implement my own analyzer?


Regards,
Jerome


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing a long list

2013-03-31 Thread Jack Krupansky
The first question is how do you want to access the data? What do you want 
your queries to look like?


What is the larger context? Are these properties of larger documents? Are 
there more than one per document? Etc.


Why not just store the property as a tokenized field? Then you can query 
whether v(i) or v(j) are or are not present as keywords.


-- Jack Krupansky

-Original Message- 
From: Paul Bell

Sent: Sunday, March 31, 2013 8:21 AM
To: java-user@lucene.apache.org
Subject: Indexing a long list

Hi All,

Suppose I need to index a property whose value is a long list of terms. For
example,

   someProperty = ["v1", "v2",  , "v100"]

Please note that I could drop the leading "v" and index these as numbers
instead of strings.

But the question is what's the best practice in Lucene when dealing with a
case like this? I need to be able to retrieve the list. This makes methink
that I need to store it. And I suppose that the list could be stored in the
index itself or in the "content" to which the index points.

So there are really two parts to this question:

1. Lucene "best practices" for long list
2. Where to store such a list

Thanks for your help.

-Paul 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing a long list

2013-03-31 Thread Jack Krupansky

Multivalued fields are the other approach to keyword value pairs.

And if you can denormalize your data, storing structure as separate 
documents can make sense and support more powerful queries. Although the 
join capabilities are rather limited.


-- Jack Krupansky

-Original Message- 
From: Paul Bell

Sent: Sunday, March 31, 2013 8:52 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing a long list

Hi Jack,

Thanks for the reply. I am very new to Lucene.

Your timing is a bit uncanny. I was just coming to the conclusion that
there's nothing special about this case for Lucene, i.e., a tokenized field
should work, when I looked up and saw your e-mail.

In re the larger context: yeah, the properties in question here belong to
some kind of node, e.g., maybe a vertex in a graph DB. Possible properties
include 'name', 'type', 'inEdges', 'outEdges', etc. Most properties are
simple k=v pairs. But a few, notable the 'edge' properties, could be long
lists.

My intent was to create a Lucene Document for each node. The Fields in this
Document would represent all of the node's properties. A generic (not in
Lucene syntax) query should be able to ask after any property, e.g.,

   ('name' equals "vol1" AND 'outEdges.name' startsWith "hasMirror")

Note that 'outEdges.name' represents multiple elements, where 'name'
represents only one. That is, the generic query syntax is trying to match
any out-edge whose name property starts with "hasMirror". I haven't quite
crystallized the generic query syntax and don't know how best to map it to
both a Lucene query and to an appropriate Lucene index structure. Please
let me know if you've any suggestions!

Thanks again.

-Paul



On Sun, Mar 31, 2013 at 8:33 AM, Jack Krupansky 
wrote:



The first question is how do you want to access the data? What do you want
your queries to look like?

What is the larger context? Are these properties of larger documents? Are
there more than one per document? Etc.

Why not just store the property as a tokenized field? Then you can query
whether v(i) or v(j) are or are not present as keywords.

-- Jack Krupansky

-Original Message- From: Paul Bell
Sent: Sunday, March 31, 2013 8:21 AM
To: java-user@lucene.apache.org
Subject: Indexing a long list


Hi All,

Suppose I need to index a property whose value is a long list of terms. 
For

example,

   someProperty = ["v1", "v2",  , "v100"]

Please note that I could drop the leading "v" and index these as numbers
instead of strings.

But the question is what's the best practice in Lucene when dealing with a
case like this? I need to be able to retrieve the list. This makes methink
that I need to store it. And I suppose that the list could be stored in 
the

index itself or in the "content" to which the index points.

So there are really two parts to this question:

1. Lucene "best practices" for long list
2. Where to store such a list

Thanks for your help.

-Paul

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MLT Using a Query created in a different index

2013-04-04 Thread Jack Krupansky
The heart of MLT is examining the top result of a query (or maybe more than 
one) and identifying the "top" terms from the top document(s) and then 
simply using those top terms for a subsequent query. The term ranking would 
of course depend on term frequency, and other relevancy considerations - for 
the corpus of the original query. A rich query corpus will give great 
results, a weak corpus will give weak results - no matter how rich or weak 
the final target corpus is. OTOH, if the target corpus really is 
representative on the source corpus, then results should be either good or 
terrible - the selected/query document may not have any representation in 
the target corpus.


-- Jack Krupansky

-Original Message- 
From: Peter Lavin

Sent: Thursday, April 04, 2013 1:06 PM
To: java-user@lucene.apache.org
Subject: MLT Using a Query created in a different index


Dear Users,

I am doing some research where Lucene is integrated into agent
technology. Part of this work involves using an MLT query in an index
which was not created from a document in that index (i.e. the query is
created, serialised and sent to the remote agent).

Can anyone point me towards any information on what the potential impact
of doing this would be?

I'm assuming if both indexes have similar sets of documents, the impact
would be negligible, but what, for example would be the impact of
creating an MLT query from an index with only one or two documents for
use in an index with several (say 100+) documents,

with thanks,
Peter

--
with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MLT Using a Query created in a different index

2013-04-05 Thread Jack Krupansky
In a statistical sense, for the majority of documents, yes, but you could 
probably find quite a few outlier examples where the results from A to B or 
from B to A as significantly or even completely different or even 
non-existent.


-- Jack Krupansky

-Original Message- 
From: Peter Lavin

Sent: Friday, April 05, 2013 3:49 AM
To: java-user@lucene.apache.org
Subject: Re: MLT Using a Query created in a different index


Thanks for that Jack,

so it's fair to say that if both the sources and target corpus are large
and diverse, then the impact of using a different index to create the
query would be negligible.

P.

On 04/04/2013 06:49 PM, Jack Krupansky wrote:

The heart of MLT is examining the top result of a query (or maybe more
than one) and identifying the "top" terms from the top document(s) and
then simply using those top terms for a subsequent query. The term
ranking would of course depend on term frequency, and other relevancy
considerations - for the corpus of the original query. A rich query
corpus will give great results, a weak corpus will give weak results -
no matter how rich or weak the final target corpus is. OTOH, if the
target corpus really is representative on the source corpus, then
results should be either good or terrible - the selected/query document
may not have any representation in the target corpus.

-- Jack Krupansky

-Original Message- From: Peter Lavin
Sent: Thursday, April 04, 2013 1:06 PM
To: java-user@lucene.apache.org
Subject: MLT Using a Query created in a different index


Dear Users,

I am doing some research where Lucene is integrated into agent
technology. Part of this work involves using an MLT query in an index
which was not created from a document in that index (i.e. the query is
created, serialised and sent to the remote agent).

Can anyone point me towards any information on what the potential impact
of doing this would be?

I'm assuming if both indexes have similar sets of documents, the impact
would be negligible, but what, for example would be the impact of
creating an MLT query from an index with only one or two documents for
use in an index with several (say 100+) documents,

with thanks,
Peter



--
with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to consider special character like + ! @ # in Lucene search

2013-04-10 Thread Jack Krupansky
You'll have to switch from using the standard analyzer/tokenizer to using 
the whitespace analyzer/tokenizer, and make sure not to use any additional 
token filters that might eliminate some or all special characters (or 
provide character maps for the ones that do accept character maps.) You will 
need to completely reindex your data as well. And, in some cases you may 
need to escape some special characters with backslash in your queries.


-- Jack Krupansky

-Original Message- 
From: neeraj shah

Sent: Wednesday, April 10, 2013 2:07 AM
To: java-user@lucene.apache.org
Subject: how to consider special character like + ! @ # in Lucene search

hello, im using Lucene2.9.


i have to search special character like "/" in given text. but when im
searching it gives me 0 hit.
I have tried  QueryParse.escape("/").  but did not get the result.
how to proceed further. please help..

--
With Regards,
Neeraj Kumar Shah 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to index Sharepoint files with Lucene

2013-04-10 Thread Jack Krupansky
The Apache ManifoldCF "connector framework" has a SharePoint connector that 
can crawl SharePoint repositories. It has an output connector that feeds 
into Solr/SolrCell, but you can easily develop a connector that outputs 
whatever you want - like put the crawled files into a file system directory, 
or maybe even send each file directly into Tika and then directly index the 
content into Lucene, if that's what you want. In any case, MCF handles the 
SharePoint access and crawling.


See:
http://manifoldcf.apache.org/en_US/index.html

-- Jack Krupansky

-Original Message- 
From: Álvaro Vargas Quezada

Sent: Wednesday, April 10, 2013 5:31 PM
To: java-user@lucene.apache.org
Subject: How to index Sharepoint files with Lucene

Hi everyone!
I'm trying to combine Lucene with Sharepoint (we use Windows and SP 2010), 
but I couldn't find good tutorials or proven tests cases that demostrate 
this integration. Do you know any tutorial or can give me some help about 
this?
I have read all the "Lucene in Action" but here just talk about indexing 
files, and not integration with other softwares.
Thanks in advanceGreetz from Chile 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

2013-04-15 Thread Jack Krupansky
I didn't read your code, but do you have the "reset" that is now mandatory 
and throws AIOOBE if not present?


-- Jack Krupansky

-Original Message- 
From: andi rexha

Sent: Monday, April 15, 2013 10:21 AM
To: java-user@lucene.apache.org
Subject: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

Hi,
I have tryed to get all the tokens from a TokenStream in the same way as I 
was doing in the 3.x version of Lucene, but now (at least with 
WhitespaceTokenizer) I get an exception:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
   at java.lang.Character.codePointAtImpl(Character.java:2405)
   at java.lang.Character.codePointAt(Character.java:2369)
   at 
org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
   at 
org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)




The code is quite simple, and I thought that it could have worked, but 
obviously it doesn't (unless I have made some mistakes).


Here is the code, in case you spot some bugs on it (although it is trivial):
String str = "this is a test";
   Reader reader = new StringReader(str);
   TokenStream tokenStream = new WhitespaceTokenizer(Version.LUCENE_42, 
reader);  //tokenStreamAnalyzer.tokenStream("test", reader);
   CharTermAttribute attribute = 
tokenStream.getAttribute(CharTermAttribute.class);

   while (tokenStream.incrementToken()) {
   System.out.println(new String(attribute.buffer(), 0, 
attribute.length()));

   }

Hope you have any idea of why it is happening.
Regards,
Andi



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

2013-04-15 Thread Jack Krupansky
Yes, reset was always "mandatory" from an API contract sense, but not always 
enforced in a practical sense in 3.x (no uniformly extreme negative 
consequences), as the original emailer indicated. Now, it is "mandatory" in 
a practical sense as well (extremely annoying consequences in all cases of a 
contract violation). So, I should have said that the contract was mandatory 
but not enforced... which from a practical perspective negates its mandatory 
contractual value.


-- Jack Krupansky

-Original Message- 
From: Uwe Schindler

Sent: Monday, April 15, 2013 11:53 AM
To: java-user@lucene.apache.org
Subject: RE: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

Hi,

It was always mandatory! In Lucene 2.x/3.x some Tokenizers just returned 
bogus, undefined stuff if not correctly reset before usage, especially when 
Tokenizers are "reused" by the Analyzer, which is now mandatory in 4.x. So 
we made it throw some Exception (NPE or AIOOBE) in Lucene 4 by initializing 
the state fields in Lucene 4.0 with some default values that cause the 
Exception. The Exception is not more specified because of performance 
reasons (it's just caused by the new default values set in ctor previously).


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-----Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Monday, April 15, 2013 4:25 PM
To: java-user@lucene.apache.org
Subject: Re: WhitespaceTokenizer, incrementToke()
ArrayOutOfBoundException

I didn't read your code, but do you have the "reset" that is now mandatory
and throws AIOOBE if not present?

-- Jack Krupansky

-Original Message-
From: andi rexha
Sent: Monday, April 15, 2013 10:21 AM
To: java-user@lucene.apache.org
Subject: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

Hi,
I have tryed to get all the tokens from a TokenStream in the same way as I
was doing in the 3.x version of Lucene, but now (at least with
WhitespaceTokenizer) I get an exception:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
at java.lang.Character.codePointAtImpl(Character.java:2405)
at java.lang.Character.codePointAt(Character.java:2369)
at
org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePoint
At(CharacterUtils.java:164)
at
org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokeniz
er.java:166)



The code is quite simple, and I thought that it could have worked, but
obviously it doesn't (unless I have made some mistakes).

Here is the code, in case you spot some bugs on it (although it is 
trivial):

String str = "this is a test";
Reader reader = new StringReader(str);
TokenStream tokenStream = new
WhitespaceTokenizer(Version.LUCENE_42,
reader);  //tokenStreamAnalyzer.tokenStream("test", reader);
CharTermAttribute attribute =
tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
System.out.println(new String(attribute.buffer(), 0,
attribute.length()));
}

Hope you have any idea of why it is happening.
Regards,
Andi



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multiple PositionIncrement attributes

2013-04-25 Thread Jack Krupansky

You can use SpanNearQuery to seek matches within a specified distance.

Lucene knows nothing about "sentences". But if you have an analyzer or 
custom code that artificially bumps the position to the next multiple of 
some number like 100 or 1000 when a sentence boundary pattern is 
encountered, you could use that number times n to match within n sentences, 
roughly, plus or minus a sentence or two - there is nothing to cause the 
nearness to be rounded or truncated exactly to one of those boundaries.


Maybe you want two numbers: 1) sentence separation, say 1000, and 2) maximum 
sentence length, say 500. The SpanNearQuery would use n-1 times the sentence 
separation plus the maximum sentence length. Well, you have to adjust that 
for how you count sentences - is 1 the current sentence or is that 0?


-- Jack Krupansky

-Original Message- 
From: Igor Shalyminov

Sent: Thursday, April 25, 2013 6:54 AM
To: java-user@lucene.apache.org
Subject: Multiple PositionIncrement attributes

Hi all!

I use PositionIncrement attribute for finding words at some distance from 
each other. And I have two problems with that:
1) I want to search words within one sentence. A possible solution would be 
to set PositionIncrement of +INF (like 30 :) ) to the sentence break tag.
2) I want to use in my search both word-distance and sentence-distance 
between words (e.g. find the word "Putin" within 3 sentences after the word 
"Obama" or find the words "cheese" and "bacon" in one sentence within 3 
words of each other).


For the 2nd problem, is there a way of storing multiple position information 
sources in the index and using them for searching? Say, at least choosing 
one of those for a query.



--
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?

2013-05-06 Thread Jack Krupansky
Do you mean the raw character offsets of the starting and ending characters 
of the terms?


No.

Although, if you index the text as a raw string, you might be able to come 
up with a regex query like "jakarta.{1,10}apache"


-- Jack Krupansky

-Original Message- 
From: wgggfiy

Sent: Monday, May 06, 2013 11:39 PM
To: java-user@lucene.apache.org
Subject: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?

As I know, the syntax *"jakarta apache"~10*, which is a PhraseQuery with a
slop=10 in position, but What I want is *based on offset* not on position?
Anyone can help me ? thx.



-
--
Email: wuqiu.m...@qq.com
--
--
View this message in context: 
http://lucene.472066.n3.nabble.com/PhraseQuery-Can-jakarta-apache-10-be-searched-by-offset-tp4061243.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?

2013-05-13 Thread Jack Krupansky

You'll have to be more explicit about the actual data and what didn't work.

Try developing a simple, self-contained unit test with some simple strings 
as input that demonstrates the case that you say doesn't work.


I mean, regular expressions and field analysis can both be quite tricky - 
even a tiny typo can break everything.


To be clear, my suggested approach to using regular expressions works only 
on un-tokenized input, so there won't be any positions or even offsets.


Other than that, you're on your own until you develop that small, 
self-contained unit test.


Or, you can file a Jira for a new Lucene Query for phrase and or span 
queries that measures distance by offsets rather than positions.


-- Jack Krupansky

-Original Message- 
From: wgggfiy

Sent: Monday, May 13, 2013 3:47 AM
To: java-user@lucene.apache.org
Subject: Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?

Jack, according to you, How can I implemt this requirement ?Could you give 
me

a clue ? thank you very much.The regex query seemed not worked ? I got the
field such asFieldType fieldType = new FieldType();
FieldInfo.IndexOptions indexOptions =
FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
fieldType.setIndexOptions(indexOptions);fieldType.setIndexed(true);
fieldType.setTokenized(true);fieldType.setStored(true);
fieldType.freeze();return new Field(name, value, fieldType);



-
--
Email: wuqiu.m...@qq.com
--
--
View this message in context: 
http://lucene.472066.n3.nabble.com/PhraseQuery-Can-jakarta-apache-10-be-searched-by-offset-tp4061243p4062852.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene and mongodb

2013-05-14 Thread Jack Krupansky
That was tried with Lucandra/Solandra, which stored the Lucene index in 
Cassandra, but was less than optimal, so that model was discarded in favor 
of indexing Cassandra data directly into Solr/Lucene, side-by-side in each 
Cassandra node, but in native Lucene. The latter approach is now available 
from DataStax as DataStax Enterprise (DSE), which also integrates Hadoop. 
DSE provides the best of Cassandra integrated tightly with the best of Solr.


In DSE, Cassandra takes care of all the cluster management, with Solr 
indexing the local Cassandra data, handing off incoming updates to Cassandra 
for normal Cassandra storage, and using a variation of the normal Solr code 
for distributing queries and merging results from other nodes, but depending 
on Cassandra for information about cluster configuration.


(Note: I had proposed to talk in more detail about the above at Lucene 
Revolution, but my proposal was not accepted.)


See:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

As it says, "DataStax Enterprise is completely free for development work."

-- Jack Krupansky

-Original Message- 
From: Rider Carrion Cleger

Sent: Tuesday, May 14, 2013 4:35 AM
To: java-user-i...@lucene.apache.org ; java-user-...@lucene.apache.org ; 
java-user@lucene.apache.org

Subject: lucene and mongodb

Hi team,
I'm working with apache lucene 4.2.1 and I would like to store lucene index
in a NoSql database.
So my questions are,
- Can I store the lucene index in a mongodb database ?

thanks you team! 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: classic.QueryParser - bug or new behavior?

2013-05-19 Thread Jack Krupansky
Yeah, just go ahead and escape the slash, either with a backslash or by 
enclosing the whole term in quotes. Otherwise the slash (even embedded in 
the middle of a term!) indicates the start of a regex query term.


-- Jack Krupansky

-Original Message- 
From: Scott Smith

Sent: Sunday, May 19, 2013 2:50 PM
To: java-user@lucene.apache.org
Subject: classic.QueryParser - bug or new behavior?

I just upgraded from lucene 4.1 to 4.2.1.  We believe we are seeing some 
different behavior.


I'm using org.apache.lucene.queryparser.classic.QueryParser.  If I pass the 
string "20110920/EXPIRED" (w/o quotes) to the parser, I get:


org.apache.lucene.queryparser.classic.ParseException: Cannot parse 
'20110920/EXPIRED': Lexical error at line 1, column 17.  Encountered:  
after : "/EXPIRED"
  at 
org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:131)


We believe this used to work.

I tried googling for this and found something that said I should use 
QueryParser.escape() on the string before passing it to the parser. 
However, that seems to break phrase queries (e.g., "John Smith" - with the 
quotes; I assume it's escaping the double-quotes and doesn't realize it's a 
phrase).


Since it is a forward slash, I'm confused why it would need escaping of any 
of the characters in the string with the "/EXPIRED".


Has anyone seen this?

Scott 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Case insensitive StringField?

2013-05-21 Thread Jack Krupansky
To be clear, analysis is not supported on StringField (or any non-tokenized 
field). But the good news is that by using the keyword tokenizer 
(KeywordTokenizer) on a TextField, you can get the same effect.


That will preserve the entire input as a single token. You may want to 
include filters to trim exterior white space and normalize interior white 
space.


-- Jack Krupansky

-Original Message- 
From: Shahak Nagiel

Sent: Tuesday, May 21, 2013 10:06 AM
To: java-user@lucene.apache.org
Subject: Case insensitive StringField?

It appears that StringField instances are treated as literals, even though 
my analyzer lower-cases (on both write and read sides).  So, for example, I 
can match with a term query (e.g. "NEW YORK"), but only if the case matches. 
If I use a QueryParser (or MultiFieldQueryParser), it never works because 
these query values are lowercased and don't match.


I've found that using a TextField instead works, presumably because it's 
tokenized and processed correctly by the write analyzer.  However, I would 
prefer that queries match against the entire/exact phrase ("NEW YORK"), 
rather than among the tokens ("NEW" or "YORK").


What's the solution here?

Thanks in advance. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query with phrases, wildcards and fuzziness

2013-05-21 Thread Jack Krupansky

Just escape embedded spaces with a backslash.

-- Jack Krupansky

-Original Message- 
From: Ross Simpson

Sent: Tuesday, May 21, 2013 8:08 PM
To: java-user@lucene.apache.org
Subject: Query with phrases, wildcards and fuzziness

Hi all,

I'm trying to create a fairly complex query, and having trouble constructing 
it.


My index contains a TextField with place names as strings, e.g.:
Port Melbourne, VIC 3207

I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so 
that my strings are not tokenized at all.


I want to support end-user searches like the following, and have them match 
that string above:

Port Melbourne, VIC 3207 (exact)
Port (prefix)
Port Mel (prefix, including a space)
Melbo (wildcard)
Melburne (fuzzy)

I'm trying to get away with not parsing the query myself, and just 
constructing something like this:

parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) );

That doesn't seem to work, neither with QueryParser nor with 
ComplexPhraseQueryParser.  Specifically, I'm having trouble getting 
appropriate results when there's a space in the input string, notable with 
the wildcard match part (it ends up returning everything in the index).


Is my approach above possible?  I also have had a look at using specific 
Query implementations and combining them in a BooleanQuery, but I'm not 
quite sure how to replicate the "OR" behavior I want (from reading, 
Occur.SHOULD is not equivalent or "OR").



Any suggestions would be appreciated.

Thanks!
Ross




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Case insensitive StringField?

2013-05-21 Thread Jack Krupansky
Yes it is. It always will. But... you can escape the spaces with a 
backslash:


Query q = qp.parse("new\\ york");

-- Jack Krupansky
-Original Message- 
From: Shahak Nagiel

Sent: Tuesday, May 21, 2013 10:09 PM
To: java-user@lucene.apache.org
Subject: Re: Case insensitive StringField?

Jack / Michael: Thanks, but the query parser still seems to be tokenizing 
the query?


public class StringPhraseAnalyzer extends Analyzer  {
   protected TokenStreamComponents createComponents (String fieldName, 
Reader reader) {

   Tokenizer tok = new KeywordTokenizer(reader);
   TokenFilter filter = new LowerCaseFilter(Version.LUCENE_41, tok);
   filter = new TrimFilter(filter, true);
   return new TokenStreamComponents(tok, filter);
   }
}

...

Analyzer analyzer = new StringPhraseAnalyzer();

// using this analyzer, add document to index with city TextField (value 
"NEW YORK")


QueryParser qp = new QueryParser(Version.LUCENE_41, "city", analyzer);

Query q = qp.parse("new york");
System.out.println ("Query: " + q);


results in...
Query: city:new city:york// I expected "city:new york"

...and no matches.  Is a QueryParser the wrong way to generate the query for 
this type of analyzer?



Thanks again!





From: Jack Krupansky 
To: java-user@lucene.apache.org
Sent: Tuesday, May 21, 2013 10:22 AM
Subject: Re: Case insensitive StringField?


To be clear, analysis is not supported on StringField (or any non-tokenized
field). But the good news is that by using the keyword tokenizer
(KeywordTokenizer) on a TextField, you can get the same effect.

That will preserve the entire input as a single token. You may want to
include filters to trim exterior white space and normalize interior white
space.

-- Jack Krupansky

-Original Message- 
From: Shahak Nagiel

Sent: Tuesday, May 21, 2013 10:06 AM
To: java-user@lucene.apache.org
Subject: Case insensitive StringField?

It appears that StringField instances are treated as literals, even though
my analyzer lower-cases (on both write and read sides).  So, for example, I
can match with a term query (e.g. "NEW YORK"), but only if the case matches.
If I use a QueryParser (or MultiFieldQueryParser), it never works because
these query values are lowercased and don't match.

I've found that using a TextField instead works, presumably because it's
tokenized and processed correctly by the write analyzer.  However, I would
prefer that queries match against the entire/exact phrase ("NEW YORK"),
rather than among the tokens ("NEW" or "YORK").

What's the solution here?

Thanks in advance.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query with phrases, wildcards and fuzziness

2013-05-22 Thread Jack Krupansky
Using BooleanQuery and Should is the way to go. There are some nuances, but 
you may not run into them. Sometimes it is more that the query parser syntax 
is the issue rather than the Lucene BQ itself. For example, with a string of 
AND and OR, they all get parsed into a single BQ, which is clearly not 
traditional "Boolean", but if you code multiple BQs that nest (or fully 
parenthesize your source query), you will get a true "Boolean" query. It's 
up to your particular application whether you need "true" Boolean or not.


-- Jack Krupansky

-Original Message- 
From: Ross Simpson

Sent: Wednesday, May 22, 2013 7:44 AM
To: java-user@lucene.apache.org
Subject: Re: Query with phrases, wildcards and fuzziness

One further question:

If I wanted to construct my query using Query implementations instead of
a QueryParser (e.g. TermQuery, WildcardQuery, etc.), what's the right
way to duplicate the "OR" functionality I wrote about below?  As I
mentioned, I've read that wrapping query objects in a BooleanQuery and
using Occur.SHOULD is not necessarily the same.

Any suggestions?

Ross


On 22/05/2013 11:46 AM, Ross Simpson wrote:
Jack, thanks very much!  I wasn't considering a space a special character 
for some reason.  That has worked perfectly.


Cheers,
Ross


On May 22, 2013, at 10:24 AM, Jack Krupansky wrote:


Just escape embedded spaces with a backslash.

-- Jack Krupansky

-Original Message- From: Ross Simpson
Sent: Tuesday, May 21, 2013 8:08 PM
To: java-user@lucene.apache.org
Subject: Query with phrases, wildcards and fuzziness

Hi all,

I'm trying to create a fairly complex query, and having trouble 
constructing it.


My index contains a TextField with place names as strings, e.g.:
Port Melbourne, VIC 3207

I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so 
that my strings are not tokenized at all.


I want to support end-user searches like the following, and have them 
match that string above:

Port Melbourne, VIC 3207 (exact)
Port (prefix)
Port Mel (prefix, including a space)
Melbo (wildcard)
Melburne (fuzzy)

I'm trying to get away with not parsing the query myself, and just 
constructing something like this:
parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR 
(STRING~1^3) );


That doesn't seem to work, neither with QueryParser nor with 
ComplexPhraseQueryParser.  Specifically, I'm having trouble getting 
appropriate results when there's a space in the input string, notable 
with the wildcard match part (it ends up returning everything in the 
index).


Is my approach above possible?  I also have had a look at using specific 
Query implementations and combining them in a BooleanQuery, but I'm not 
quite sure how to replicate the "OR" behavior I want (from reading, 
Occur.SHOULD is not equivalent or "OR").



Any suggestions would be appreciated.

Thanks!
Ross




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting position increments directly from the the index

2013-05-23 Thread Jack Krupansky
It might be nice to inquire as to the largest position for a field in a 
document. Is that information kept anywhere? Not that I know of, although I 
suppose it can be calculated at runtime by running though all the terms of 
the field. Then he could just divide by 1000.


-- Jack Krupansky

-Original Message- 
From: Michael McCandless

Sent: Thursday, May 23, 2013 6:28 AM
To: Lucene Users
Subject: Re: Getting position increments directly from the the index

Do you actually index the sentence boundary as a token?  If so, you
could just get the totalTermFreq of that token?


Mike McCandless

http://blog.mikemccandless.com


On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov
 wrote:

Hello!

I'm storing sentence bounds in the index as position increments of 1000.
I want to get the total number of sentences in the index, i. e. the number 
of "1000" increment values.
Can I do that some other way rather than just loading each document and 
extracting position increments with a custom Analyzer?


--
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   >