Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Jack Krupansky
The standard analyzer should remove those ampersands and pluses, so the core 
alpha terms should be matched. You would need to use the white space 
analyzer or a custom analyzer to preserve such special characters.


Please give a specific indexed text string and a specific query that fails 
against it.


Also, QueryParser.escape will also escape asterisks, so they won't perform 
wildcard query. And then the standard analyzer will remove the asterisks as 
it does with most punctuation. If you switch to an analyzer that preserves 
special characters, you can then manually escape special characters with a 
backslash, and then leave the asterisk unescaped to perform a wildcard 
query.


-- Jack Krupansky

-Original Message- 
From: saisantoshi

Sent: Sunday, October 20, 2013 6:02 PM
To: java-user@lucene.apache.org
Subject: Re: Handling special characters in Lucene 4.0

StandardAnalyzer both at index and search time. We use the default one and
don't have any custom analyzers.

Thanks,
Sai



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096710.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Jack Krupansky
Maybe you are not using the same analyzer at index and query time. Even 
though you are correctly escaping the special query syntax characters, 
either the query analyzer is removing them or your index analyzer removed 
them. What analyzer are you using at index time? And, what analyzer are you 
using at query time?


-- Jack Krupansky

-Original Message- 
From: saisantoshi

Sent: Sunday, October 20, 2013 12:47 PM
To: java-user@lucene.apache.org
Subject: Handling special characters in Lucene 4.0

I have created strings like the below

&&searchtext
+sampletext

and when I try to search the following using *&&** or *+** it does not give
any result.

I am using QueryParser.escape(String s) method to handle the special
characters but does not look like it did anything.

Also, when I search something like this:

title:search*

it works and returns the search result

but when I search like the following, it wont work
title:*&&**

( No Result)

Is the above valid search criteria? If not, can someone suggest here what
would be appropriate search criteria?

Seems like StandardAnalyzer is stripping out all the special characters and
searching and that's why when we search without special characters, it does
seem to work.

Thanks,
Sai.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Wildcard question

2013-10-09 Thread Jack Krupansky

You get to decide:

class QueryParser extends QueryParserBase:

/**
* Set to true to allow leading wildcard characters.
* 
* When set, * or ? are allowed as
* the first character of a PrefixQuery and WildcardQuery.
* Note that this can produce very slow
* queries on big indexes.
* 
* Default: false.
*/
@Override
public void setAllowLeadingWildcard(boolean allowLeadingWildcard) {
 this.allowLeadingWildcard = allowLeadingWildcard;
}

And the default is "false" (leading wildcard not allowed.)

-- Jack Krupansky

-Original Message- 
From: Carlos de Luna Saenz

Sent: Wednesday, October 09, 2013 6:32 PM
To: java-user@lucene.apache.org
Subject: Wildcard question

I've used Lucene 2,3 and now 4... i used to believe that * wildcard on the 
begining was acepted since 3 (but never used) and reviewing documentation 
says "Note: You cannot use a * or ? symbol as the first character of a 
search." is that ok or is a missupdated note on the 
http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description 
documentation?
Thanks in advance. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about the CompoundWordTokenFilterBase

2013-09-18 Thread Jack Krupansky
Out of curiosity, what is your use case? I mean, the normal use of this 
filter is to permit a "shorthand" reference to a long term, but why would 
you necessarily want to preclude direct reference to the full term?


-- Jack Krupansky

-Original Message- 
From: Alex Parvulescu

Sent: Wednesday, September 18, 2013 10:27 AM
To: java-user@lucene.apache.org
Subject: Question about the CompoundWordTokenFilterBase

Hi,

While trying to play with the CompoundWordTokenFilterBase I noticed that
the behavior is to include the original token together with the new
sub-tokens.

I assume this is expected (haven't found any relevant docs on this), but I
was wondering if it's a hard requirement or can I propose a small change to
skip the original token (controlled by a flag)?

If there's interest I can put this in a JIRA issue and we can continue the
discussion there.

The patch is not too complicated, but I haven't ran any of the tests yet :)

thanks,
alex 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can you escape characters you don't want the analyzer to modify

2013-09-17 Thread Jack Krupansky
It sounds like you either need to have a custom analyzer or a field-aware 
analyzer.


-- Jack Krupansky

-Original Message- 
From: Scott Smith

Sent: Tuesday, September 17, 2013 4:26 PM
To: java-user@lucene.apache.org
Subject: Can you escape characters you don't want the analyzer to modify

Suppose I have a string like "ab@cd%d".  My analyzer will turn this into "ab 
cd d".  Can I pass it "ab\@cd\%d" and force it to treat it as a single word? 
I want to use the Query parser, but I don't want it messing with fields that 
have not been analyzed. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query type always Boolean Query even if * and ? are present.

2013-09-12 Thread Jack Krupansky
Queries have a syntax, with terms and operators. If no operators, there is 
an implicit operator (AND/OR = mandatory/optional). Generally a sequence of 
terms will always generate a BooleanQuery.


So, any time you use some lexical construct that has meaning to the query 
parser that you don't want, then you need to escape it.


-- Jack Krupansky

-Original Message- 
From: Ankit Murarka

Sent: Thursday, September 12, 2013 11:36 AM
To: java-user@lucene.apache.org
Subject: Re: Query type always Boolean Query even if * and ? are present.

BingoThis has solved my case... Thanks a ton..!!

Does this mean any input containing spaces and query being parsed using
QueryParser will result in Query type as Boolean Query UNLESS white
space is escaped..

So if the input contains white space and parsed using QueryParser it
will eventually turn into Boolean since for each separate word it will
create a term and each term will be treated for Boolean Query.. ??

Can you please point any other possible pitfall when using QueryParser
and input containing special characters...

On 9/12/2013 8:59 PM, Jack Krupansky wrote:
You're not escaping white space, so your input will be a sequence of 
terms, which should generate a BooleanQuery. What is the last clause of 
the BQ? It should be your PrefixQuery.


-- Jack Krupansky

-Original Message- From: Ankit Murarka
Sent: Thursday, September 12, 2013 11:25 AM
To: java-user@lucene.apache.org
Subject: Re: Query type always Boolean Query even if * and ? are present.

If I remove the escape call from the function, then it works as
expected.. Prefix/Boolean/Wildcard..

But this is NOT what I want... The escape should be present else I will
get lexical error in case of Prefix/Boolean/Wildcard since my input will
definitely contain special characters...

Help needed..

TIA..

On 9/12/2013 8:52 PM, Ankit Murarka wrote:

I also tried it with this query:

*

I am still getting it as Boolean Query.. It should be Prefix...

On 9/12/2013 8:50 PM, Jack Krupansky wrote:
The trailing asterisk in your query input is escaped with a backslash, 
so the query parser will not treat it as a wildcard.


-- Jack Krupansky

-Original Message- From: Ankit Murarka
Sent: Thursday, September 12, 2013 10:19 AM
To: java-user@lucene.apache.org
Subject: Query type always Boolean Query even if * and ? are present.

Hello.

I am faced with a trivial issue: Everytime my Query is being fired as a
Boolean Query...

Providing Input : value="USER_NAME_MENTIONED"/>\*


This input is provided. Since this contains special characters I use
escape method of QueryParser (removed escaping for * and ? since they
are needed for WildCard and Prefix Searches.)

public static String escape(String s) {

System.out.println("CALLED");
StringBuilder sb = new StringBuilder();

try
{
for (int i = 0; i < s.length(); i++) {
  char c = s.charAt(i);
  // These characters are part of the query syntax and must be
escaped
  if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '('
|| c == ')' || c == ':'
|| c == '^' || c == '[' || c == ']' || c == '\"' || c ==
'{' || c == '}' || c == '~'
|| c == '|' || c == '&' || c == '/') {
sb.append('\\');
  }
  sb.append(c);
}
}catch(Exception e)
{
e.printStackTrace();
}
return sb.toString();
  }

/*The function which is used to provide the HIT:*/

Directory directory=null;;
IndexReader reader=null;
IndexSearcher searcher=null;
Analyzer analyzer=null;
try
{
Analyzer analyzer = new
StandardAnalyzer(Version.LUCENE_44);

directory = FSDirectory.open(new
File(strMainIndexPath.replace("\\", "/")));
reader = DirectoryReader.open(directory);  //location
where indexes are.
searcher = new IndexSearcher(reader);

System.out.println("Searching for '" + searchString +
"' using QueryParser");
QueryParser queryParser = new
QueryParser(Version.LUCENE_44,"contents",analyzer);

searchString=CommonMethods.escape(searchString);
//MENTIONED ABOVE

System.out.println("ESCAPE STRING AFTER THE ESCAPE
FUNCTION OF QUERYPARSER >>>  "  +   searchString);

Query query = queryParser.parse(searchString);

System.out.println("Type of query: "
+query.getClass().getSimpleName());   //GETTING IT AS BOOLEAN ALWAYS..

The last output is always BOOLEAN Query... Even if I am appending * at
the end, the

Re: Query type always Boolean Query even if * and ? are present.

2013-09-12 Thread Jack Krupansky
You're not escaping white space, so your input will be a sequence of terms, 
which should generate a BooleanQuery. What is the last clause of the BQ? It 
should be your PrefixQuery.


-- Jack Krupansky

-Original Message- 
From: Ankit Murarka

Sent: Thursday, September 12, 2013 11:25 AM
To: java-user@lucene.apache.org
Subject: Re: Query type always Boolean Query even if * and ? are present.

If I remove the escape call from the function, then it works as
expected.. Prefix/Boolean/Wildcard..

But this is NOT what I want... The escape should be present else I will
get lexical error in case of Prefix/Boolean/Wildcard since my input will
definitely contain special characters...

Help needed..

TIA..

On 9/12/2013 8:52 PM, Ankit Murarka wrote:

I also tried it with this query:

*

I am still getting it as Boolean Query.. It should be Prefix...

On 9/12/2013 8:50 PM, Jack Krupansky wrote:
The trailing asterisk in your query input is escaped with a backslash, so 
the query parser will not treat it as a wildcard.


-- Jack Krupansky

-Original Message- From: Ankit Murarka
Sent: Thursday, September 12, 2013 10:19 AM
To: java-user@lucene.apache.org
Subject: Query type always Boolean Query even if * and ? are present.

Hello.

I am faced with a trivial issue: Everytime my Query is being fired as a
Boolean Query...

Providing Input : \*

This input is provided. Since this contains special characters I use
escape method of QueryParser (removed escaping for * and ? since they
are needed for WildCard and Prefix Searches.)

public static String escape(String s) {

System.out.println("CALLED");
StringBuilder sb = new StringBuilder();

try
{
for (int i = 0; i < s.length(); i++) {
  char c = s.charAt(i);
  // These characters are part of the query syntax and must be
escaped
  if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '('
|| c == ')' || c == ':'
|| c == '^' || c == '[' || c == ']' || c == '\"' || c ==
'{' || c == '}' || c == '~'
|| c == '|' || c == '&' || c == '/') {
sb.append('\\');
  }
  sb.append(c);
}
}catch(Exception e)
{
e.printStackTrace();
}
return sb.toString();
  }

/*The function which is used to provide the HIT:*/

Directory directory=null;;
IndexReader reader=null;
IndexSearcher searcher=null;
Analyzer analyzer=null;
try
{
Analyzer analyzer = new
StandardAnalyzer(Version.LUCENE_44);

directory = FSDirectory.open(new
File(strMainIndexPath.replace("\\", "/")));
reader = DirectoryReader.open(directory);  //location
where indexes are.
searcher = new IndexSearcher(reader);

System.out.println("Searching for '" + searchString +
"' using QueryParser");
QueryParser queryParser = new
QueryParser(Version.LUCENE_44,"contents",analyzer);

searchString=CommonMethods.escape(searchString);
//MENTIONED ABOVE

System.out.println("ESCAPE STRING AFTER THE ESCAPE
FUNCTION OF QUERYPARSER >>>  "  +   searchString);

Query query = queryParser.parse(searchString);

System.out.println("Type of query: "
+query.getClass().getSimpleName());   //GETTING IT AS BOOLEAN ALWAYS..

The last output is always BOOLEAN Query... Even if I am appending * at
the end, the query is always Boolean...
I have to use QueryParser ONLY
Absolutely no manipulation is done on string from being given as in
input to the string which is provided to this search function.

Kindly guide..

TIA.







--
Regards

Ankit Murarka

"What lies behind us and what lies before us are tiny matters compared with 
what lies within us"



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query type always Boolean Query even if * and ? are present.

2013-09-12 Thread Jack Krupansky
The trailing asterisk in your query input is escaped with a backslash, so 
the query parser will not treat it as a wildcard.


-- Jack Krupansky

-Original Message- 
From: Ankit Murarka

Sent: Thursday, September 12, 2013 10:19 AM
To: java-user@lucene.apache.org
Subject: Query type always Boolean Query even if * and ? are present.

Hello.

I am faced with a trivial issue: Everytime my Query is being fired as a
Boolean Query...

Providing Input : \*

This input is provided. Since this contains special characters I use
escape method of QueryParser (removed escaping for * and ? since they
are needed for WildCard and Prefix Searches.)

public static String escape(String s) {

System.out.println("CALLED");
StringBuilder sb = new StringBuilder();

try
{
for (int i = 0; i < s.length(); i++) {
  char c = s.charAt(i);
  // These characters are part of the query syntax and must be
escaped
  if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '('
|| c == ')' || c == ':'
|| c == '^' || c == '[' || c == ']' || c == '\"' || c ==
'{' || c == '}' || c == '~'
|| c == '|' || c == '&' || c == '/') {
sb.append('\\');
  }
  sb.append(c);
}
}catch(Exception e)
{
e.printStackTrace();
}
return sb.toString();
  }

/*The function which is used to provide the HIT:*/

Directory directory=null;;
IndexReader reader=null;
IndexSearcher searcher=null;
Analyzer analyzer=null;
try
{
Analyzer analyzer = new
StandardAnalyzer(Version.LUCENE_44);

directory = FSDirectory.open(new
File(strMainIndexPath.replace("\\", "/")));
reader = DirectoryReader.open(directory);  //location
where indexes are.
searcher = new IndexSearcher(reader);

System.out.println("Searching for '" + searchString +
"' using QueryParser");
QueryParser queryParser = new
QueryParser(Version.LUCENE_44,"contents",analyzer);

searchString=CommonMethods.escape(searchString);
//MENTIONED ABOVE

System.out.println("ESCAPE STRING AFTER THE ESCAPE
FUNCTION OF QUERYPARSER >>>  "  +   searchString);

Query query = queryParser.parse(searchString);

System.out.println("Type of query: "
+query.getClass().getSimpleName());   //GETTING IT AS BOOLEAN ALWAYS..

The last output is always BOOLEAN Query... Even if I am appending * at
the end, the query is always Boolean...
I have to use QueryParser ONLY
Absolutely no manipulation is done on string from being given as in
input to the string which is provided to this search function.

Kindly guide..

TIA.

--
Regards

Ankit Murarka

"What lies behind us and what lies before us are tiny matters compared with 
what lies within us"



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Profiling Solr Lucene for query

2013-09-08 Thread Jack Krupansky
Please send Solr-related inquiries to the Solr user list - this is the 
Lucene (Java) user list.


-- Jack Krupansky

-Original Message- 
From: Manuel Le Normand

Sent: Sunday, September 08, 2013 7:03 AM
To: java-user@lucene.apache.org
Subject: Profiling Solr Lucene for query

Hello all
Looking on the 10% slowest queries, I get very bad performances (~60 sec
per query).
These queries have lots of conditions on my main field (more than a
hundred), including phrase queries and rows=1000. I do return only id's
though.
I can quite firmly say that this bad performance is due to slow storage
issue (that are beyond my control for now). Despite this I want to improve
my performances.

As tought in school, I started profiling these queries and the data of ~1
minute profile is located here:
http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg

Main observation: most of the time I do wait for readVInt, who's stacktrace
(2 out of 2 thread dumps) is:

catalina-exec-3870 - Thread t@6615
java.lang.Thread.State: RUNNABLE
at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
at
org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:2357)
at
ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
at
org.apache.lucene.search.PhraseQuery$PhraseWeight.(PhraseQuery.java:221)
at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
at
org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:183)
at
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
at
org.apache.lucene.searth.BooleanQuery$BooleanWeight.(BooleanQuery.java:183)
at
oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
at
org.apache.lucene.searth.BooleanQuery$BooleanWeight.(BooleanQuery.java:183)
at
org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
at
org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)


So I do actually wait for IO as expected, but I might be too many time page
faulting while looking for the TermBlocks (tim file), ie locating the term.
As I reindex now, would it be useful lowering down the termInterval
(default to 128)? As the FST (tip files) are that small (few 10-100 MB) so
there are no memory contentions, could I lower down this param to 8 for
example?

General configs:
solr 4.3
36 shards, each has few million docs

Thanks in advance,
Manu 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Expunge deleting using excessive transient disk space

2013-09-08 Thread Jack Krupansky
You're in the wrong room! (Lucene/Java-user) Move the discussion over to the 
Solr user list.


-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Sunday, September 08, 2013 7:56 AM
To: java-user
Subject: Re: Expunge deleting using excessive transient disk space

How much free disk space do you have when you
try the merge?

Is this a typo?

2
Note name=|

Best
Erick


On Sun, Sep 8, 2013 at 7:26 AM, Manuel Le Normand <
manuel.lenorm...@gmail.com> wrote:


Hi again,
In order to delete part of my index I run a delete by query that intends 
to

erase 15% of the docs.
I added this params to the solrconfig.xml

   2
   2
   5000.0
   10.0
   15.0


The extra params were added in order to promote merge of old segments but
with restriction on the transient disk that can be used (as I have only
15GB per shard).

This procedure failed on a no space left on device exception, although
proper calculations show that these params should cause no usage excess of
the transient free disk space I have.
 Looking on the infostream I can see that the first merges do succeed but
older segments are kept in reference thus cannot be deleted until all the
merging are done.

Is there anyway of overcoming this?




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Fuzzy Searching on Lucene / Solr

2013-08-14 Thread Jack Krupansky
The limit of 2 is hard-coded precisely because good performance for editing 
distances above 2 cannot be guaranteed.


-- Jack Krupansky

-Original Message- 
From: Michael Tobias

Sent: Wednesday, August 14, 2013 1:00 AM
To: java-user@lucene.apache.org
Subject: Fuzzy Searching on Lucene / Solr

My first post so please be gentle with me.

I am about to start 'playing' with Solr to see if it will be the correct
tool for a new searchable database development.  One of my requirements is
the ability to do 'fuzzy' searches and I understand that the latest versions
of Lucene / Solr use an improved version of indexing and the Levenshtein
distance formula (or rather a modified version of Levenshtein if wished for,
treating letter transpositions as a single difference rather than 2).

Levenshtein is precisely what I need, but I also understand that the maximum
distance currently implemented is a distance of just TWO.  That is not
really adequate for my purposes.  I need to be able to handle at least a
distance of 3 and probably 4.

Is the current maximum distance of 2 hard-coded in the system?  Can it be
over-ridden?  How?

I understand that performance (both indexing and querying) may be impaired
significantly by doing this but that might be a price worth paying.  If it
IS possible to change the max distance to 3 or 4 does anybody have any idea
what the performance implications might be?

Many thanks for any/all assistance you can provide.

Regards

Michael


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DV limited to 32766 ?

2013-08-09 Thread Jack Krupansky

Check out the discussion on:

https://issues.apache.org/jira/browse/LUCENE-4583
"StraightBytesDocValuesField fails if bytes > 32k"

-- Jack Krupansky

-Original Message- 
From: Nicolas Guyot 
Sent: Friday, August 09, 2013 5:57 PM 
To: java-user@lucene.apache.org 
Subject: DV limited to 32766 ? 


Hi,

when writing a large binaryDocValue, BinaryDocValuesWriter throws an
exception saying DocValuesField "fieldName" is too large, must be <= 32766

Is there a way to avoid that limit?

Thanks,
Nicolas

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query serialization/deserialization

2013-08-04 Thread Jack Krupansky
"it would be nice to be able to emit classic Lucene query parser queries 
where possible"


Yeah, but then we hit the problem of the Query terms having been through 
analysis.


Maybe it would be nice if we had query syntax to indicate that terms had 
already been analyzed.


-- Jack Krupansky

-Original Message- 
From: Michael Sokolov

Sent: Sunday, August 04, 2013 4:55 PM
To: java-user@lucene.apache.org
Cc: Denis Bazhenov
Subject: Re: Query serialization/deserialization

On 07/28/2013 07:32 PM, Denis Bazhenov wrote:
A full JSON query ser/deser would be an especially nice additionto Solr, 
allowing direct access to all Lucene Query features even if they haven't 
been integrated into the higher level query parsers.
There is nothing we could do, so we wrote one, in fact :) I'll try to 
elaborate with the team on the question of contributing it to OSS.


On Jul 29, 2013, at 1:54 AM, Jack Krupansky  
wrote:



Yeah, it's a shame such a ser/deser feature isn't available in Lucene.

My idea is to have a separate module that the Query classes can delegate 
to for serialization and deserialization, handling recursion for nested 
query objects, and then have modules for XML, JSON, and a pseudo-Java 
functional notation (maybe closer to JavaScript) for the actual 
formatting and parser. And then developers can subclass the module to add 
any custom Query classes of their own.

I struggled with this as well, and ended up implementing a parallel
query class hierarchy for query classes I cared about; they support a
toXmlNode that generates an xml tree that can then be parsed by the
Lucene XML query parser.  You can see the code here
(https://github.com/msokolov/lux/blob/master/src/main/java/lux/query/ParseableQuery.java)
but it would be hard to use outside my project.  A more general solution
would definitely be welcome.

One challenge of course is the wealth of different query parsers: it
would be nice to be able to emit classic Lucene query parser queries
where possible, wouldn't it, in addition to structured output like XML
and JSON?

-Mike


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to Index each file and then each Line for Complete Phrase Match. Sample Data shown.

2013-08-03 Thread Jack Krupansky
Why not start with something simple? Like, index each log line as a 
tokenized text field and then do PhraseQuery against that text field? Is 
there something else you need beyond that?


-- Jack Krupansky

-Original Message- 
From: Ankit Murarka

Sent: Saturday, August 03, 2013 3:22 AM
To: java-user@lucene.apache.org
Subject: How to Index each file and then each Line for Complete Phrase 
Match. Sample Data shown.


Hello All,

I have this mentioned in the log file. Till now I am indexing the
complete directory containing files which contain data like this:

Now I need to index each line of the file to implement complete phrase
search. I intend to store phrases in index and then use SpellChecker API
to suggest me similar phrases.

7/20/2013 7:45 *package execution happening-1
* FATAL *check request has been sent for instance* Ip:Port
*EXCEPTION*
7/20/2013 7:45 *This is not working perfectly
* DEBUG *check request for instance being received is status=200
* Ip:Port *EXCEPTION*
7/20/2013 7:45 *Encountering a constant error.
* DEBUG *response is not proper.Expecting some more information on
this detail.
* Ip:Port *EXCEPTION*
7/20/2013 7:45 *This needs urgent attention
* FATAL *I am still trying to ensure it is running perfectly.
Encountering some issues.
* Ip:Port *EXCEPTION*

7/20/2013 8:01 *Job is running fine.*
INFO
*\

*Exception Occured in ClassFactory* * Function()
java.nullPointerException: Value is null
* *Should not be null*

To implement complete phrase search I reckon I need to index each line and 
store the phrase .*Phrases in the above mentioned table are highlighted in 
Bold.*


So, if I am able to index these and store these phrases as indexes, so when 
User tries to search for "package executing",


the Lucene would be able to provide me "package execution happening-1" as a 
valid suggestion..


These columns does not have a name to them and hence I cannot index based on 
column name. Also as shown in the table above, first column may contain 
time/date or a phrase in itself (shown in last row).


Please suggest. How is it possible using Lucene and its API. Javadoc does 
not seem to guide me anywhere for this case.


--
Regards

Ankit Murarka

"What lies behind us and what lies before us are tiny matters compared with 
what lies within us"



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to by pass analyzer one some fields in QueryParser ?

2013-07-28 Thread Jack Krupansky

PerFieldAnalyzerWrapper

http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html

"This analyzer is used to facilitate scenarios where different fields 
require different analysis techniques."


-- Jack Krupansky

-Original Message- 
From: Wenbo Zhao

Sent: Sunday, July 28, 2013 10:41 AM
To: lucene-user-mail-list
Subject: Re: how to by pass analyzer one some fields in QueryParser ?

sorry guys
I think I made a mistake
the parse string I use was "date:\"2013/07/2*\" text:..."
the quoted "" makes queryparser ignored trailing '*' and analyze the string.
I changed to "date:2013\/07\/2* text:...", it works fine.

sorry for the disturb :-)



2013/7/28 Wenbo Zhao 


Hi, all
I'm stuck in one simple question, as title says, I think it should have a
simple solution.
Say I use StandardAnalyzer and have two fields in all documents,
StringField("date"...)  is not tokenized, format is 2013/07/28
TextField("text" ...) is tokenized.

QueryParser parse "date:2013/07/2* text:something" returns
"+date:"2013 07 2" +text:something"
which is apparently not I want.

I have other StringFields, not only date, so I'm looking for a general
solution, not
specified to this date format.

any reply will be appreciated .   Thanks.

--

Best Regards,
ZHAO, Wenbo

===





--

Best Regards,
ZHAO, Wenbo

=== 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query serialization/deserialization

2013-07-28 Thread Jack Krupansky

Yeah, it's a shame such a ser/deser feature isn't available in Lucene.

My idea is to have a separate module that the Query classes can delegate to 
for serialization and deserialization, handling recursion for nested query 
objects, and then have modules for XML, JSON, and a pseudo-Java functional 
notation (maybe closer to JavaScript) for the actual formatting and parser. 
And then developers can subclass the module to add any custom Query classes 
of their own.


A full JSON query ser/deser would be an especially nice additionto Solr, 
allowing direct access to all Lucene Query features even if they haven't 
been integrated into the higher level query parsers.


And maybe the format should have a flag for whether terms have been analyzed 
or not. Then, deserialization could optionally do analysis as well.


The Solr QueryParsing.toString method shows a purely external approach to 
serialization (I've done something similar myself.) This is what is output 
in the "parsedquery" section of debugQuery output for a Solr query response.


-- Jack Krupansky

-Original Message- 
From: Denis Bazhenov

Sent: Sunday, July 28, 2013 1:59 AM
To: java-user@lucene.apache.org
Subject: Query serialization/deserialization

I'm looking for a tool to serialize and deserialize Lucene queries. We have 
tried using Query.toString(), but some queries return string that couldn't 
be parsed by a QueryParser afterwards. The alternative possibility is to use 
standard Java serialization mechanism. The reason I'm trying to avoid it, is 
that serialization is used to communicate in distributed system, and we 
can't guarantee the equality of Lucene version at all nodes at any 
particular point in time.


Is there some way to perform query serialization in "lucene version 
independent way"?

---
Denis Bazhenov 






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance measurements

2013-07-25 Thread Jack Krupansky
In addition, although I am a bit beyond my expertise here, I believe you 
should be able to take any query object, including one returned from a query 
parser, wrap it with a ConstantScoreQuery, and then search on the CSQ to 
avoid all the scoring overhead.


For example, "*:*" is super fast even though it matches everything - no 
scoring.


-- Jack Krupansky

-Original Message- 
From: Arjen van der Meijden

Sent: Thursday, July 25, 2013 3:06 PM
To: java-user@lucene.apache.org
Subject: Re: Performance measurements

Hi Sriram,

I don't see any obvious mistakes, although you don't need to create a
FilteredQuery: There are plenty of search-methods on the IndexSearcher
that accept both a query (your TermQuery) and a filter (your TermsFilter).

The way I understand Filters (but I have no advanced in-depth knowledge
of them) is that they are very similar to Queries.
Queries are used for two tasks; matching a item and giving some measure
of how "well" it matched (i.e. the score).
Filters are used only for matching, but I doubt there is very much
difference from a technical point of view between the to ways of
matching items.

I'll leave more detailed explanations to others, as I might make too
many mistakes or just assume I know something I actually don't :)

Best regards,

Arjen

On 25-7-2013 19:56 Sriram Sankar wrote:

Thanks everyone.  I'm trying this out:


So searching would become:
- Create a Query with only your termA
- Create a TermsFilter with all your termB's
- execute your preferred search-method with both the query and the filter


I don't the get the same results as before - and am still debugging.  But
I'm including before and after code in case someone is able to see a
problem with what I'm doing.

I'm also looking for docs on how filters work (or will read the code). 
But

at a high level, is the filter fully created when the Filter object is
created?  Or is it incrementally built during traversal (when next() and
advance() are called on the filters).  Reason for this question is related
to early termination.

The two versions of code (query based and filter based) are shown below -
let me know if you see a problem with either.  Ignore any minor syntactic
errors that may have got introduced as I simplified my code for inclusion
here.

Thanks,

Sriram.


QUERY APPROACH:

BooleanQuery orTerms = new BooleanQuery();
for (int i = 0; i < orCount; ++i) {
 TermQuery orArg = new TermQuery(new Term("conn",
  Integer.toString(connection[i])));
 BooleanClause cl = new BooleanClause(orArg, 
BooleanClause.Occur.SHOULD);

 orTerms.add(cl);
}
TermQuery tq = new TermQuery(new Term("name", name));
BooleanQuery query = new BooleanQuery();
query.add(new BooleanClause(tq, BooleanClause.Occur.MUST));
query.add(new BooleanClause(orTerms, BooleanClause.Occur.MUST));

FILTER APPROACH:

List orTerms = new ArrayList();
for (int i = 0; i < orCount; ++i) {
 terms.add(new Term("conn",
Integer.toString(connection[i])));
}
TermsFilter conns = new TermsFilter(terms);
TermQuery tq = new TermQuery(new Term("name", name));
FilteredQuery query = new FilteredQuery(tq, conns);



On Thu, Jul 25, 2013 at 12:14 AM, Arjen van der Meijden <
acmmail...@tweakers.net> wrote:


On 24-7-2013 21:58 Sriram Sankar wrote:

On Wed, Jul 24, 2013 at 10:24 AM, Jack Krupansky 

**wrote:



Scoring has been a major focus of Lucene. Non-scored filters are also

available, but the query parsers are focused (exclusively) on
scored-search.


When you say "filter" do you mean a step performed after retrieval?  Or 
is

it yet another retrieval operation?



He is really referring to the Filters available as an addition to
retrieval. The ones you supply with the search-method:
http://lucene.apache.org/core/**4_4_0/core/org/apache/lucene/**
search/IndexSearcher.html#**search%28org.apache.lucene.**
search.Query,%20org.apache.**lucene.search.Filter,%20int%29<http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20int%29>

Unfortunately the documentation of Lucene is a bit fragmented, but
basically they limit the scope of your search domain (i.e. reduce the
available set of documents) during the processing of a query. So it
basically becomes (query) AND (filters).

There are several useful implementations available for the filters. But 
in

your case you can just create a single TermsFilter (its in the queries
module/package) which is simply a OR-list like the one in your example
(similar to a basic IN in sql):

http://lucene.apache.org/core/**4_4_0/queries/org/apache/**
lucene/queries/TermsFilter.**html<http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/TermsFilter.html>

So searching would become:
- Create a Query with only your termA
- Create a TermsFilte

Re: Performance measurements

2013-07-24 Thread Jack Krupansky
I think I've exhausted my expertise in Lucene filters, but I think you can 
wrap a query with a filter and also wrap a filter with a query. So, for 
IndexSearcher.search, you could take a filter and wrap it with 
ConstantScoreQuery. So, if a BooleanQuery got wrapped as a filter, it could 
be wrapped as a CSQ for search so that no scoring would be done.


-- Jack Krupansky

-Original Message- 
From: Sriram Sankar

Sent: Wednesday, July 24, 2013 3:58 PM
To: java-user@lucene.apache.org
Subject: Re: Performance measurements

On Wed, Jul 24, 2013 at 10:24 AM, Jack Krupansky 
wrote:



Unicorn sounds like it was optimized for graph search. Specialized search
engines can in fact beat out generalized search engines for specific use
cases.



Yes and no (I worked on it).  Yes, there are many aspect of Unicorn that
have been optimized for graph search.  But the tests I am running have very
little to do with those optimizations.  I am still learning about Lucene
and have suspected that the scoring framework (that has to be very general)
may be contributing to the performance issues.  With Unicorn, we made a
decision to do all scoring after retrieval and not during retrieval.




Scoring has been a major focus of Lucene. Non-scored filters are also
available, but the query parsers are focused (exclusively) on 
scored-search.




When you say "filter" do you mean a step performed after retrieval?  Or is
it yet another retrieval operation?




As Adrien indicates, try using raw Lucene filters and you should get much
better results. Whether even that will compete with a use-case-specific
(graph) search engine remains to be seen.



Thanks (I will study this more).

Sriram.






-- Jack Krupansky

-Original Message- From: Sriram Sankar
Sent: Wednesday, July 24, 2013 1:03 PM
To: java-user@lucene.apache.org
Subject: Re: Performance measurements


No I do not need scoring.  This is a pure retrieval query - which matches
what we used to do with Unicorn in Facebook - something like:

(name:sriram AND (friend:1 OR friend:2 ...))

This automatically gives us second degree.

With Unicorn, we would always get sub-millisecond performance even for
n>500.

Should I assume that Lucene is that much worse - or is it that this use
case has not been optimized?

Sriram.



On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand  wrote:

 Hi,


On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar  wrote:
> termA AND (termB1 OR termB2 OR ... OR termBn)

Maybe this comment is not appropriate for your use-case, but if you
don't actually need scoring from the disjunction on the right of the
query, a TermsFilter will be faster when n gets large.

--
Adrien

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance measurements

2013-07-24 Thread Jack Krupansky
Unicorn sounds like it was optimized for graph search. Specialized search 
engines can in fact beat out generalized search engines for specific use 
cases.


Scoring has been a major focus of Lucene. Non-scored filters are also 
available, but the query parsers are focused (exclusively) on scored-search.


As Adrien indicates, try using raw Lucene filters and you should get much 
better results. Whether even that will compete with a use-case-specific 
(graph) search engine remains to be seen.


-- Jack Krupansky

-Original Message- 
From: Sriram Sankar

Sent: Wednesday, July 24, 2013 1:03 PM
To: java-user@lucene.apache.org
Subject: Re: Performance measurements

No I do not need scoring.  This is a pure retrieval query - which matches
what we used to do with Unicorn in Facebook - something like:

(name:sriram AND (friend:1 OR friend:2 ...))

This automatically gives us second degree.

With Unicorn, we would always get sub-millisecond performance even for
n>500.

Should I assume that Lucene is that much worse - or is it that this use
case has not been optimized?

Sriram.



On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand  wrote:


Hi,

On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar  wrote:
> termA AND (termB1 OR termB2 OR ... OR termBn)

Maybe this comment is not appropriate for your use-case, but if you
don't actually need scoring from the disjunction on the right of the
query, a TermsFilter will be faster when n gets large.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance measurements

2013-07-24 Thread Jack Krupansky
Thanks for the detailed numbers. Nothing seems unexpected to me. Increasing 
query complexity or term count is simply going to increase query execution 
time.


I think I'll add a new rule to my informal performance guidance - Query 
complexity of no more than ten to twenty terms is a "slam dunk", but more 
than that is "uncharted territory" that risks queries taking more than half 
a second or even multiple seconds and requires a proof of concept 
implementation to validate reasonable query times.


-- Jack Krupansky

-Original Message- 
From: Sriram Sankar

Sent: Wednesday, July 24, 2013 12:11 PM
To: java-user@lucene.apache.org
Subject: Performance measurements

I did some performance tests on a real index using a query having the
following pattern:

termA AND (termB1 OR termB2 OR ... OR termBn)

The results were not good and I was wondering if I may be doing something
wrong (and what I would need to do to improve performance), or is it just
that the OR is very inefficient.

The format for the data below is illustrated below by example:

5|10
time: 0.092728962; scored: 18

Here, n=5, and we measure performance for retrieval of 10 results which is
0.0927ms. Had we not early terminated, we would have obtained 18 results.

As you will see in the data below, the performance for n=0 is very good,
but goes down drastically as n is increased.

Sriram.


0|10
time: 0.007941587; scored: 10887

0|1000
time: 0.018967384; scored: 10887

0|5000
time: 0.061943552; scored: 10887

0|1
time: 0.115327001; scored: 10887

1|10
time: 0.053950965; scored: 0

5|20
time: 0.274681853; scored: 18

10|10
time: 0.14251254; scored: 22

10|20
time: 0.282503313; scored: 22

20|10
time: 0.251964067; scored: 32

20|30
time: 0.52860957; scored: 32

50|10
time: 0.888969702; scored: 57

50|30
time: 1.078579956; scored: 57

50|50
time: 1.601169195; scored: 57

100|10
time: 1.396391061; scored: 79

100|40
time: 1.8083494; scored: 79

100|80
time: 2.921094513; scored: 79

200|10
time: 2.848105701; scored: 119

200|50
time: 3.472198462; scored: 119

200|100
time: 4.722673648; scored: 119

400|10
time: 4.463727049; scored: 235

400|100
time: 6.554119665; scored: 235

400|200
time: 9.591892527; scored: 235 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: QueryParser for DisjunctionMaxQuery, et al.

2013-07-23 Thread Jack Krupansky
I came up with some human-readable BNF rules for the Solr, dismax, and 
edismax query parsers. They are in my book. In fact they are linked from the 
"Solr Hot Spots" section in the preface.


The main problems with re-parsing from a lucene Query structure are 
two-fold:


1. Terms have been analyzed, and there is no guarantee that term analysis is 
idempotent (repeatable without changing the result.)
2. "Query enrichment" or "query enhancement" may have occurred - additional 
terms and operators added. Reparsing them may result in a second (or even 
nth) level of enrichment/enhancement. IOW, query enrichment/enhancement is 
not guaranteed to to be idempotent.


That said, sure somebody could gin up a re-parser. The question is how 
useful it would be. I suppose if the spec for the reparser was that it was 
literal and with no term analysis or query enrichment/enhancement, it could 
work. The fact that nobody has done it is a  testament to its marginal 
utility.


Did you have a specific application in mind?

A more useful utility would be a "partial parser" that only parses without 
analysis or enrichment and generates an abstract syntax tree that 
applications could than access and manipulate and then "regenerate" a true 
source query that doesn't have analysis or enrichment (except as the 
application may explicitly have performed on the tree.)


-- Jack Krupansky

-Original Message- 
From: Beale, Jim (US-KOP)

Sent: Tuesday, July 23, 2013 10:07 AM
To: java-user@lucene.apache.org
Subject: QueryParser for DisjunctionMaxQuery, et al.

Hello all,

It seems somewhat odd to me that the Query classes generate strings that the 
QueryParser won't parse. Does anyone have a QueryParser that will parse the 
full range of Lucene query strings?


Failing that, has the BNF been written down somewhere?  I can't seem to find 
it for the full cases.


Thanks for any info/guidance.


Cheers,

Jim Beale
Lead Developer
Hibu.com

The information contained in this email message, including any attachments, 
is intended solely for use by the individual or entity named above and may 
be confidential. If the reader of this message is not the intended 
recipient, you are hereby notified that you must not read, use, disclose, 
distribute or copy any part of this communication. If you have received this 
communication in error, please immediately notify me by email and destroy 
the original message, including any attachments. Thank you.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Trying to search java.lang.NullPointerException in log file.

2013-07-22 Thread Jack Krupansky
"This is because the StandardAnalyzer must be splitting the words on 
"SPACES" and since there is no space present here. The entire string is 
converted into 1 token."


Those statements are inconsistent! I mean, what code is converting the 
entire string to 1 token and eliminating white space? Is that your own code 
before you hand the string to the standard analyzer??? That makes no sense. 
I mean, the standard analyzer is using the standard tokenizer that doesn't 
do that!


Are you applying the same analyzer at query time as you do at index time? It 
is not uncommon for Lucene users to forget to do that. If you don't, then 
you will have to hand-analyze the query string and simulate exactly what the 
standard analyzer did at index time.


So, please clarify your situation.


-- Jack Krupansky

-Original Message- 
From: Ankit Murarka

Sent: Monday, July 22, 2013 6:24 AM
To: java-user@lucene.apache.org
Subject: Trying to search java.lang.NullPointerException in log file.

Hello. I am trying to search java.lang.NullPointerException in a log
file. The log file is huge.

However I am unable to search it. This is because the StandardAnalyzer
must be splitting the words on "SPACES" and since there is no space
present here. The entire string is converted into 1 token.

What can be a possible way of finding
"Exception:java.lang.NullPointerException" in a log file.

The string may be different also. Suppose "Exception:
java.lang.NullPointerException error occured"

I am trying to use Phrase Query but I am not sure if that will serve the
purpose.

Can please someone suggest.

--
Regards
Ankit


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Trying to search java.lang.NullPointerException in log file.

2013-07-22 Thread Jack Krupansky
Standard anlyzer/tokenizer will use white space and other punctuation to 
delimit tokens. The rules are a little complicated (although I tried to 
summarize them for Solr in my book) - the same rules apply for Lucene.


Verify that you are properly constructing a PhraseQuery from your analyzed 
text at query time. What is the exact query text and what are the exact 
analyzer tokens for that query text and how many are there?


-- Jack Krupansky

-Original Message- 
From: Ankit Murarka

Sent: Monday, July 22, 2013 10:29 AM
To: java-user@lucene.apache.org
Subject: Re: Trying to search java.lang.NullPointerException in log file.

First thing first : Same analyzer is being used to index and to search.

Now, I am not using any custom analyzer to split the string and get the
tokens. I was assuming StandardAnalyzer might be using whitespaces to
split the content. If that is not the case then I must have got it
completely wrong.

So for searching "java.lang.NullPointer" how should I proceed? This
string might be present after : like ":java.lang.NullPointer" . In both
cases I want to search for "java.lang.NullPointer" only.


On 7/22/2013 7:51 PM, Jack Krupansky wrote:
"This is because the StandardAnalyzer must be splitting the words on 
"SPACES" and since there is no space present here. The entire string is 
converted into 1 token."


Those statements are inconsistent! I mean, what code is converting the 
entire string to 1 token and eliminating white space? Is that your own 
code before you hand the string to the standard analyzer??? That makes no 
sense. I mean, the standard analyzer is using the standard tokenizer that 
doesn't do that!


Are you applying the same analyzer at query time as you do at index time? 
It is not uncommon for Lucene users to forget to do that. If you don't, 
then you will have to hand-analyze the query string and simulate exactly 
what the standard analyzer did at index time.


So, please clarify your situation.


-- Jack Krupansky

-Original Message- From: Ankit Murarka
Sent: Monday, July 22, 2013 6:24 AM
To: java-user@lucene.apache.org
Subject: Trying to search java.lang.NullPointerException in log file.

Hello. I am trying to search java.lang.NullPointerException in a log
file. The log file is huge.

However I am unable to search it. This is because the StandardAnalyzer
must be splitting the words on "SPACES" and since there is no space
present here. The entire string is converted into 1 token.

What can be a possible way of finding
"Exception:java.lang.NullPointerException" in a log file.

The string may be different also. Suppose "Exception:
java.lang.NullPointerException error occured"

I am trying to use Phrase Query but I am not sure if that will serve the
purpose.

Can please someone suggest.




--
Regards

Ankit Murarka

"Peace is found not in what surrounds us, but in what we hold within."


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching for words begining with "or"

2013-07-18 Thread Jack Krupansky
Just so you know, the presence of a wildcard in a term means that the term 
will not be analyzed. So, state:OR* should fail since "OR" will not be in 
the index - because it would index as "or" (lowercase). Hmmm... why does 
"or" seem familiar...?


Ah yeah... right!... The standard analyzer includes the standard stop 
filter, which defaults to using this set of stopwords:


final List stopWords = Arrays.asList(
 "a", "an", "and", "are", "as", "at", "be", "but", "by",
 "for", "if", "in", "into", "is", "it",
 "no", "not", "of", "on", "or", "such",
 "that", "the", "their", "then", "there", "these",
 "they", "this", "to", "was", "will", "with"
);

And... "or" is on that list! So, the standard analyzer is removing "or" from 
the index! That's why the query can't find it.


Unless you really want these stop words removed, construct your own analyzer 
that does not do stop word removal.


-- Jack Krupansky

-Original Message- 
From: ABlaise

Sent: Friday, July 19, 2013 12:07 AM
To: java-user@lucene.apache.org
Subject: Re: Searching for words begining with "or"

When I make my query, everything goes well until I add the last part :
(city:or* OR state:or*).
I tried the first solution that was given to me but putting \OR and \AND
doesn't seem to be the solution. The query is actually well built, he has no
problem with OR or \OR to parse the query since the query looks like that :
+(+(areaType:city areaType:neighborhood areaType:county)
+areaName:portland*) +(city:or* state:or*).
It seems to me as a valid query. It's just that he can't seem to find the
'OR' *in* the index... it's like they don't exist. And I know this because
if I retrieve the last dysfunctional part of the query, he finds (among
others) the right document, with the state written in it... It's like he
can't 'see' the 'or' in the index...

As for the upper/lower case, I am using a standard Analyzer to index and to
search and I feed him with the states in upper case and he doesn't seem to
change it. Still, I tried to put them in lower case but it didn't change
anything...

Thanks in advance for your future answers and for the help you already
provided me with.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-words-begining-with-or-tp4079018p4079035.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching for words begining with "or"

2013-07-18 Thread Jack Krupansky
Break your query down into simpler pieces for testing. What pieces seem to 
have what problems? Be specific about the symptom, and how you "know" that 
something is wrong.


You wrote:
stored,indexed,tokenized,omitNorms>.

But... the standard analyzer would have lowercased that term. Did it, or are 
you using some other analyzer?


-- Jack Krupansky

-Original Message- 
From: ABlaise

Sent: Thursday, July 18, 2013 9:19 PM
To: java-user@lucene.apache.org
Subject: Searching for words begining with "or"

Hi everyone,

I am new to this forum, I have made some research for my question but I
can't seem to find an answer for it.
I am using Lucene for a project and I know for sure that in my lucene index
I have somewhere this document with these elements :
Document
stored,indexed,tokenized,omitNorms
stored,indexed,tokenized,omitNorms
stored,indexed,tokenized,omitNorms>.

I am looking for it but this query doesn't work :  "(+areaType:(City OR
Neighborhood OR County) +areaName:portland*) AND *(city:or* OR state:or*)*"
and I have tried tons of alternatives (o*, o*r, ...). Lucene seems to
mistake 'or' for the OR operator. How should I do to be more precise ?
To add precision to my question, this String goes through a QueryParser with
a StandardAnalyzer before being searched for in the index.

Any help would be welcomed !
Thanks in advance,

Adrien



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-words-begining-with-or-tp4079018.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing into SolrCloud

2013-07-18 Thread Jack Krupansky
Sorry, but you need to resend this message to the Solr user list - this is 
the Lucene user list.


-- Jack Krupansky

-Original Message- 
From: Beale, Jim (US-KOP)

Sent: Thursday, July 18, 2013 12:34 PM
To: java-user@lucene.apache.org
Subject: Indexing into SolrCloud

Hey folks,

I've been migrating an application which indexes about 15M documents from 
straight-up Lucene into SolrCloud.  We've set up 5 Solr instances with a 3 
zookeeper ensemble using HAProxy for load balancing. The documents are 
processed on a quad core machine with 6 threads and indexed into SolrCloud 
through HAProxy using ConcurrentUpdateSolrServer in order to batch the 
updates.  The indexing box is heavily-loaded


I've been accepting the default HttpClient with 50K buffered docs and 2 
threads, i.e.,


int solrMaxBufferedDocs = 5;
int solrThreadCount = 2;
solrServer = new ConcurrentUpdateSolrServer(solrHttpIPAddress, 
solrMaxBufferedDocs, solrThreadCount);


autoCommit is configured in the solrconfig as follows:


  60
  50
  false


I'm getting the following errors on the client and server sides 
respectively:


Client side:

2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO 
SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught 
when processing request: Software caused connection abort: socket write 
error
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO 
SystemDefaultHttpClient - Retrying request
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO 
SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught 
when processing request: Software caused connection abort: socket write 
error
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO 
SystemDefaultHttpClient - Retrying request


Server side:

7988753 [qtp1956653918-23] ERROR org.apache.solr.core.SolrCore  â 
java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] 
early EOF
   at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
   at 
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
   at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
   at 
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
   at 
org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)


When I disabled autoCommit on the server side, I didn't see any errors there 
but I still get the issue client-side after about 2 million documents - 
which is about 45 minutes.


Has anyone seen this issue before?  I couldn't find anything useful on the 
usual places.


I suppose I could setup wireshark to see what is happening but I'm hoping 
that someone has a better suggestion.


Thanks in advance for any help!


Best regards,
Jim Beale

hibu.com
2201 Renaissance Boulevard, King of Prussia, PA, 19406
Office: 610-879-3864
Mobile: 610-220-3067

The information contained in this email message, including any attachments, 
is intended solely for use by the individual or entity named above and may 
be confidential. If the reader of this message is not the intended 
recipient, you are hereby notified that you must not read, use, disclose, 
distribute or copy any part of this communication. If you have received this 
communication in error, please immediately notify me by email and destroy 
the original message, including any attachments. Thank you.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query expansion in Lucene (4.x)

2013-07-17 Thread Jack Krupansky
We don't commonly use the term "query expansion" for Lucene and Solr, but I 
would say that there are two categories of "QE":


1. Lightweight QE, by which I mean things like synonym expansion, stemming, 
stopword removal, spellcheck, and anything else that modifies the raw query 
in any way that affects the results. MoreLikeThis/Find Similar could also be 
considered lightweight QE. In this lightweight sense, Lucene and Solr have 
lots of QE features.


2. Heavyweight QE, such as the "Unsupervised Feedback" feature that 
LucidWorks Search offers, which can automatically iterate on the original 
query, either refining or expanding the results to increase either precision 
or recall (your choice). Lucene and Solr themselves do not offer this 
automated capability, although the LucidWorks Search query feature is 
completely implemented using underlying Lucene and Solr features.


Wikipedia:
http://en.wikipedia.org/wiki/Query_expansion

Lucid:
http://docs.lucidworks.com/display/help/Unsupervised+Feedback+Options
http://docs.lucidworks.com/display/lweug/Understanding+and+Improving+Relevance#UnderstandingandImprovingRelevance-UnsupervisedFeedback

-- Jack Krupansky

-Original Message- 
From: Michael O'Leary

Sent: Wednesday, July 17, 2013 5:58 PM
To: java-user@lucene.apache.org
Subject: Query expansion in Lucene (4.x)

I was reading a paper about Query Expansion (
http://search.fub.it/claudio/pdf/CSUR-2012.pdf) and it said: "For instance,
Google Enterprise, MySQL and Lucene provide the user with an AQE facility
that can be turned on or off."

I searched through the Lucene 4.1.0 source code, which is what I have
downloaded, and through Lucene forums and the web generally looking for
references to query expansion in Lucene, and what I found was some
discussion of the SynonymFilter class and external projects called LucQE
and LuceneQE.

Is there other code in Lucene that supports query expansion that I missed
(because I didn't expand the queries I was searching with enough or
something :-) or are these the primary Lucene and Lucene-compatible
components that support query expansion?
Thanks,
Mike 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is text searching algorithm in Lucene 4.3.1

2013-07-17 Thread Jack Krupansky

The core tf-idf scoring is described in this Javadoc:
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

That describes the scoring model and cites some papers.

Then you can navigate up to the base class and see that BM25 is another 
derived class. Unfortunately, that has less Javadoc, although it does cite a 
key paper on that approach.


-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Wednesday, July 17, 2013 8:17 AM
To: java-user
Subject: Re: What is text searching algorithm in Lucene 4.3.1

Note: as of Lucene 4.x, you can plug in your
own scoring algorithm, it ships with several
variants (e.g. BM25) so you can look at the
pluggable scoring where all the code for the
various algorithms is concentrated.

Erick

On Wed, Jul 17, 2013 at 12:40 AM, Jack Krupansky
 wrote:

The source code is what most people use to understand how Lucene actually
works. In some cases the Javadoc comments will point to published papers 
or

web sites for algorithms or approaches.

-- Jack Krupansky

-Original Message- From: Vinh Đặng
Sent: Tuesday, July 16, 2013 10:54 PM
To: java-user@lucene.apache.org
Subject: What is text searching algorithm in Lucene 4.3.1


Hi all,



I am trying to apply Lucene for a specific domain, so I need to customize
the text searching / text comparing algorithm of Lucene.



Is there any guideline / tutorial or article which explains about how 
Lucene

search and answer the query?



Thank you very much.



--

Thank you very much

VINH Dang (Mr)




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is text searching algorithm in Lucene 4.3.1

2013-07-16 Thread Jack Krupansky
The source code is what most people use to understand how Lucene actually 
works. In some cases the Javadoc comments will point to published papers or 
web sites for algorithms or approaches.


-- Jack Krupansky

-Original Message- 
From: Vinh Đặng

Sent: Tuesday, July 16, 2013 10:54 PM
To: java-user@lucene.apache.org
Subject: What is text searching algorithm in Lucene 4.3.1

Hi all,



I am trying to apply Lucene for a specific domain, so I need to customize 
the text searching / text comparing algorithm of Lucene.




Is there any guideline / tutorial or article which explains about how Lucene 
search and answer the query?




Thank you very much.



--

Thank you very much

VINH Dang (Mr)




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [ANNOUNCE] Web Crawler

2013-07-15 Thread Jack Krupansky
Lucene does not provide any capabilities for crawling websites. You would 
have to contact the Nutch project, the ManifoldCF project, or other web 
crawling projects.


As far as bypassing robots.txt, that is a very unethical thing to do. It is 
rather offensive that you seem to be suggesting that anybody on this mailing 
list would engage in such an unethical or unprofessional activity.


-- Jack Krupansky

-Original Message- 
From: Ramakrishna

Sent: Monday, July 15, 2013 9:13 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Hi..

I'm trying nutch to crawl some web-sites. Unfortunately they restricted to
crawl their web-site by writing robots.txt. By using crawl-anywhere can I
crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz
send me the materials/links to study about crawl-anywhere or else plz
suggest me which are the crawlers to use to crawl web-sites without
bothering about robots.txt of that particular site. Its urgent plz reply as
soon as possible.

Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene in Action

2013-07-10 Thread Jack Krupansky
Oh, yeah, ... that was the original book, that got canceled. Long story. 
O'Reilly should have taken that page down by now - oh well. But now I have 
the Solr-only book, self-published as an e-book on Lulu.com.


Yes, LIA2 is still a valuable resource. Details have changed, but most 
concepts are still valid.


-- Jack Krupansky

-Original Message- 
From: Ivan Brusic

Sent: Wednesday, July 10, 2013 10:41 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene in Action

Jack, don't you also have a book coming out on O'Reilly?

http://shop.oreilly.com/product/0636920028765.do

Lucene in Action might be outdated, but many of the core concepts remain
the same. The analysis chain (analyzers/tokenizers) might have a slightly
different API, but the concepts are still valid.

--
Ivan


On Wed, Jul 10, 2013 at 6:55 AM, Vinh Dang  wrote:


Thank Jack very much.

Btw, I still find a details tutorial which guide me step by step, from
download lucene until configure IDE (I unzipped lucene and received a
complex folder - I don't know what should I do next?)

And, a dummy question,there are three files to download in lucene 4.3.1
download page. Which file I should download?
Sent from my BlackBerry® smartphone from Viettel

-Original Message-
From: "Jack Krupansky" 
Sender:  "Jack Krupansky" 
Date: Wed, 10 Jul 2013 08:50:39
To: 
Reply-To: java-user@lucene.apache.org
Subject: Re: Lucene in Action

I would also note that Solr is a great way to start learning Lucene since 
a

lot of the underlying Lucene concepts are visible in Solr and eventually a
lot of Lucene users will end up having to re-invent features that Solr 
adds

to Lucene, anyway.

Maybe when I finish covering Solr in my e-book I'll start diving into
coverage of Lucene as well. The original book idea did include covering
both
Lucene and Solr, but that deal fell through and I decided to stick with
Solr
as my initial focus.

Actually, I've already had some thoughts about adding some chapters on a
deeper dive into underlying Lucene concepts for Solr users, such as the
structure of Query objects and tokenization and token filtering, mostly
since advanced Solr users run into issues there, but those areas are
difficult for Lucene users as well.

My e-book:

http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-2/ebook/product-21099700.html

-- Jack Krupansky

-Original Message-
From: Erick Erickson
Sent: Wednesday, July 10, 2013 8:14 AM
To: java-user ; dqvin...@gmail.com
Subject: Re: Lucene in Action

Right, unfortunately, there's nothing that I know of that's super-recent.
Jack Kupransky is e-publishing a book on Solr, which will be more up
to date but I don't know how thoroughly it dives into the underlying 
Lucene

code.

Otherwise, I think the best thing is to tackle a real problem (perhaps
try working on a JIRA?) and get into it that way.

The 3.1 book has a lot of good background, it's still valuable. But the
examples will be out of date.

Unfortunately, writing a book is an incredible amount of work, writing
code is much more fun ...

Best
Erick

On Tue, Jul 9, 2013 at 11:22 AM, Vinh Dang  wrote:
> Please try, I am new on Lucene also, and willing to study and share :)
> Sent from my BlackBerry(R) smartphone from Viettel
>
> -Original Message-
> From: "Vinh Dang" 
> Date: Tue, 9 Jul 2013 15:19:41
> To: 
> Reply-To: dqvin...@gmail.com
> Subject: Re: Lucene in Action
>
> You have my yesterday question :)
>
> After unzip lucene, you just need to import lucene core. JAR file into
> your project to use (with eclipse,just drag and drop).
>
> Lucene core.jar (I do not remember exact name, but easy to find this jar
> file) provides core functions of lucene
> --Original Message--
> From: Šimun Šunjić
> To: java-user@lucene.apache.org
> ReplyTo: java-user@lucene.apache.org
> Subject: Lucene in Action
> Sent: Jul 9, 2013 16:08
>
> I am learning about Apache Lucene from Manning book: Lucene in Action.
> However examples from book is for Lucene v3.0.3 and today Lucene is in
> version 4.3.1. I can't find any good newer Lucene tutorial for learning,
> can you guys from community suggest me some :)
>
> Thanks
>
> --
> mag.inf. Šunjić Šimun
>
>
> Sent from my BlackBerry(R) smartphone from Viettel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene in Action

2013-07-10 Thread Jack Krupansky
I would also note that Solr is a great way to start learning Lucene since a 
lot of the underlying Lucene concepts are visible in Solr and eventually a 
lot of Lucene users will end up having to re-invent features that Solr adds 
to Lucene, anyway.


Maybe when I finish covering Solr in my e-book I'll start diving into 
coverage of Lucene as well. The original book idea did include covering both 
Lucene and Solr, but that deal fell through and I decided to stick with Solr 
as my initial focus.


Actually, I've already had some thoughts about adding some chapters on a 
deeper dive into underlying Lucene concepts for Solr users, such as the 
structure of Query objects and tokenization and token filtering, mostly 
since advanced Solr users run into issues there, but those areas are 
difficult for Lucene users as well.


My e-book:
http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-2/ebook/product-21099700.html

-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Wednesday, July 10, 2013 8:14 AM
To: java-user ; dqvin...@gmail.com
Subject: Re: Lucene in Action

Right, unfortunately, there's nothing that I know of that's super-recent.
Jack Kupransky is e-publishing a book on Solr, which will be more up
to date but I don't know how thoroughly it dives into the underlying Lucene
code.

Otherwise, I think the best thing is to tackle a real problem (perhaps
try working on a JIRA?) and get into it that way.

The 3.1 book has a lot of good background, it's still valuable. But the
examples will be out of date.

Unfortunately, writing a book is an incredible amount of work, writing
code is much more fun ...

Best
Erick

On Tue, Jul 9, 2013 at 11:22 AM, Vinh Dang  wrote:

Please try, I am new on Lucene also, and willing to study and share :)
Sent from my BlackBerry(R) smartphone from Viettel

-Original Message-
From: "Vinh Dang" 
Date: Tue, 9 Jul 2013 15:19:41
To: 
Reply-To: dqvin...@gmail.com
Subject: Re: Lucene in Action

You have my yesterday question :)

After unzip lucene, you just need to import lucene core. JAR file into 
your project to use (with eclipse,just drag and drop).


Lucene core.jar (I do not remember exact name, but easy to find this jar 
file) provides core functions of lucene

--Original Message--
From: Šimun Šunjić
To: java-user@lucene.apache.org
ReplyTo: java-user@lucene.apache.org
Subject: Lucene in Action
Sent: Jul 9, 2013 16:08

I am learning about Apache Lucene from Manning book: Lucene in Action.
However examples from book is for Lucene v3.0.3 and today Lucene is in
version 4.3.1. I can't find any good newer Lucene tutorial for learning,
can you guys from community suggest me some :)

Thanks

--
mag.inf. Šunjić Šimun


Sent from my BlackBerry(R) smartphone from Viettel


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Please Help solve problem of bad read performance in lucene 4.2.1

2013-07-07 Thread Jack Krupansky
To be clear, Lucene and Solr are "search" engines, NOT "storage" engines. 
Has someone claimed otherwise to you?


What is your query performance in in 4.x vs. 3.x? That's the true, proper 
measure of Lucene and Solr performance.


-- Jack Krupansky

-Original Message- 
From: Chris Zhang

Sent: Sunday, July 07, 2013 12:26 PM
To: java-user@lucene.apache.org
Subject: Re: Please Help solve problem of bad read performance in lucene 
4.2.1


thianks Adrien,
In my project, almost all hit docs are supposed to be fetched for every
query, what's why I am upset by the poor reading performance. Maybe I
should store field values which are expected to be stored in high
performance storage engine.
In the above test case, time consuming of reading all docs in lucene 3.0 is
about 78 sec, that reading speed is approximately 10MB/s , but 700+ sec in
lucene 4.2.1, which indicates reading speed is less than 1MB/s.  So I think
committer of lucene should pay attention to this.


On Sun, Jul 7, 2013 at 10:23 PM, Adrien Grand  wrote:


Indeed, Lucene 4.1+ may be a bit slower for indices that comptelely
fit in your file-system cache. On the other hand, you should see
better performance with indices which are larger than the amount of
physical memory of your machine. Your reading benchmark only measures
IndexReader.get(int) which should only be used to display summary
results (that is, only called 10 or 20 times per displayed page). Most
of time, the bottleneck is rather searching which can be made more
efficient on small indices by switching to an in-memory postings
format.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Forcing lucene to use specific field when processing parsed query

2013-07-06 Thread Jack Krupansky
Each Query object has getter methods for each of its fields. If a field is a 
nested Query object, you will then have to recursively process it. Be 
prepared to do lots of "instanceof" conditionals. None of this is hard, just 
lots of dog work. Consult the Javadoc for the getter methods.


And when you write your instanceof conditionals, be sure to include an 
"else" that displays the actual class name so that you can then add yet 
another instanceof.


Now, what you are really looking for is the TermQuery, that has a field 
name - that's what you want to change.


After you have called all of the getters, you create a new Query object of 
the current type (the type you referenced in the "instanceof") and return 
that new Query object. Recursion would return the new Query object.


-- Jack Krupansky

-Original Message- 
From: Puneet Pawaia

Sent: Saturday, July 06, 2013 12:54 PM
To: java-user@lucene.apache.org
Subject: Re: Forcing lucene to use specific field when processing parsed 
query


Hi Uwe
Unfortunately, at the stage I am required to do this,  I do not have the
query text. I only have the parsed query.
I thought of iterating through the query clauses etc but could not find how
to. What function allows me to do this ?
Regards
Puneet Pawaia
On 6 Jul 2013 20:06, "Uwe Schindler"  wrote:


Hi,

You can only do this manually with instanceof checks and walking through
BooleanClauses.

The better way to fix your problem would be to change your query parser,
e.g. by overriding getFieldQuery and other protected methods to enforce a
specific field while parsing the query string.

Uwe



Puneet Pawaia  schrieb:
>Hi all,
>
>I am using Lucene.Net 3.0.3 and need to search in a specific field
>(ignoring any fields specified in the query). I am given a parsed
>Lucene
>Query so I am unable to generate a parsed query with my required field.
>
>Is there any functionality in Lucene that allows me to loop through the
>terms of the query and change the field ? The given query would be a
>complex query where there would be spans, clauses etc.
>
>Or perhaps there is some way of forcing Lucene to ignore the fields
>given
>in the parsed query and used a specified field only.
>
>Any help would be most appreciated.
>
>Regards
>Puneet Pawaia

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: handling nonexistent fields in an index

2013-07-03 Thread Jack Krupansky
There is a Lucene filter that you can use to check efficiently for whether a 
field has a value or not.


new ConstantScoreQuery(new FieldValueFilter(String field, boolean negate))

-- Jack Krupansky

-Original Message- 
From: David Carlton

Sent: Wednesday, July 03, 2013 4:27 PM
To: java-user@lucene.apache.org
Subject: handling nonexistent fields in an index

I have a bunch of Lucene indices lying around, and I want to start adding a
new field to documents in new indices that I'm generating. So, for a given
index, either every document in the index will have that field or no
document will have that field.

The new field has a default value; and I would like to write a query that,
when applied to old indices, matches all documents, while when applied to
new indices, it will only match documents with that specific default value.
(Probably the query will include other restrictions, but the other
restrictions have nothing to do with the new field, so they'll apply to
both indices.)


I can, of course, write two different queries, one for the old indices and
one for the new indices; for layering reasons, I'd prefer not to do that,
but it's a possibility. (I can't, however, go back to the old indices and
add the new field in.)

Any suggestions for how to write a single query that will work in both
places? Basically, what I want is a query that says something like

 (field IS MISSING) OR (field = DEFAULT_VALUE)

If it matters, the new field will only take one of a small number of
values, ten or so.


The one hint I've turned up when googling is this:
http://stackoverflow.com/questions/4365369/solr-search-for-documents-where-a-field-doesnt-exist

It talks in terms of Solr, but hopefully I can figure out how to translate
that into stock Lucene? Thinking out loud about what it suggests, I guess
maybe I can generate a WildcardQuery for my field with * (which I hope
won't be too expensive, given how few values my field has), and then do
something like

(field = DEFAULT_VALUE) OR NOT (field matches *)

And then I have to translate that into Lucene BooleanQuery syntax; I think
I can probably handle that step of things (I've done that sort of thing
before), but if anybody has tips, I'm all ears.


Basically, any suggestions would be welcome, whether about the basic
approach or about the details. And I would in particular very much
appreciate advice as to whether or not WildcardQuery(field, *) will have
good performance if field only takes a small number of values.

--
David Carlton
carl...@sumologic.com 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: highlighting component to searchComponent

2013-07-01 Thread Jack Krupansky
Try asking your question on the “Solr user” email list – this is the Lucene 
user list!

-- Jack Krupansky

From: Adrien RUFFIE 
Sent: Monday, July 01, 2013 4:36 AM
To: java-user@lucene.apache.org 
Subject: highlighting component to searchComponent

Hello all I had the following configuration in my solrconfig.xml :

 













 















 















 

 

 

But when I start my webapp the following message appears: WARNING: Deprecated 
syntax found.  should move to 

 

So I have tried to convert my highlighting to searchComponent with following 
configuration:

 





















  70

  0.5 

  [-\w 
,/\n\"']{20,200}





















 

But now following error appears … do you know the problem and what is the 
correct class implementation ?

 

GRAVE: org.apache.solr.common.SolrException: Error Instantiating 
SolrFragmentsBuilder, solr.highlight.GapFragmenter is not a 
org.apache.solr.highlight.SolrFragmentsBuilder

 

if you know the right way to do this conversion, I'm interested

 

Best regards,

 

Bien cordialement,
 


 
 
  Adrien RUFFIE
  Ingénieur R&D 
 
  40, rue du Village d’Entreprises
  31670 Labège
  www.e-deal.com 
 
  LD : +33 1 73 03 29 50
  Std : +33 1 73 03 29 80
  Fax : +33 1 73 01 69 77
  a.ruf...@e-deal.com 
 


E-DEAL soutient le Pacte Mondial de l'ONU

 


Re: Relevance ranking calculation based on filtered document count

2013-07-01 Thread Jack Krupansky
The very definition of a "filter" in Lucene is that it doesn't influence 
relevance/scoring in any way, so your question is a contradiction in terms.


If you are finding that the use of a filter is affecting the scores of 
documents, then that is clearly a bug.


-- Jack Krupansky

-Original Message- 
From: Nigel V Thomas

Sent: Monday, July 01, 2013 7:38 AM
To: java-user@lucene.apache.org
Subject: Relevance ranking calculation based on filtered document count

Hi,

I would like to know if it is possible to calculate the relevance ranks of
documents based on filtered document count?
The current filter implementations as far as I know, seems to be applied
after the query is processed and ranked against the full set of documents.
Since system wide IDF values are used to rank documents, the resulting
ordering is different from a set whose range is restricted only to the
filtered set of documents.

Many thanks,

Nigel 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to Perform a Full Text Search on a Number with Leading Zeros or Decimals?

2013-06-28 Thread Jack Krupansky
The user could use a regular expression query to match the numbers, but 
otherwise, you will have to write some specialized token filter to recognize 
numeric tokens and generate extra tokens at the same position for each token 
variant that you want to search for.


-- Jack Krupansky

-Original Message- 
From: Todd Hunt

Sent: Friday, June 28, 2013 2:18 PM
To: java-user@lucene.apache.org
Subject: How to Perform a Full Text Search on a Number with Leading Zeros or 
Decimals?


I have an application that is indexing the text from various reports and 
forms that are generated from our core system.  The reports will contain 
dollar amounts and various indexes that contain all numbers, but have 
leading zeros.


If a document contains that following text that is stored in one Lucene 
document field:


"Account 0012345 owes $321.98"

What analyzer can be used to index this text and allow the user to find this 
document by searching on:


12345

OR

321

???

We are currently using a StandardAnalyzer which works well for most of our 
use cases, but not one like this.


I realize that I could create my own token filter to convert any text that 
can be represented by an Integer or Long, with leading zeros or not, and 
convert the value to a normal looking integer without leading zeros.  But 
I'd prefer to reuse and existing analyzer or technique to achieve the same 
results.


Thank you.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions about doing a full text search with numeric values

2013-06-27 Thread Jack Krupansky
Do continue to experiment with Solr as a "testbed" - all of the analysis 
filters used by Solr are... part of Lucene, so once you figure things out in 
Solr (using the Solr Admin UI analysis page), you can mechanically translate 
to raw Lucene API calls.


Look at the standard tokenizer, it should do a better job with punctuation.

-- Jack Krupansky

-Original Message- 
From: Todd Hunt

Sent: Thursday, June 27, 2013 1:14 PM
To: java-user@lucene.apache.org
Subject: Questions about doing a full text search with numeric values

I am working on an application that is using Tika to index text based 
documents and store the text results in Lucene.  These documents can range 
anywhere from 1 page to thousands of pages.


We are currently using Lucene 3.0.3.  I am currently using the 
StandarAnalyzer to index and search for the text that is contained in one 
Lucene document field.


For strictly alpha based, English words, the searches return the results as 
expected.  The problem has to do with searching for numeric values in the 
indexed documents.  So examples of text in the documents that cannot be 
found unless wild cards are used are:


Ø  1-800-costumes.com

o   800 does not find the text above

Ø  $118.30

o   118 does not find the text above

Ø  3tigers

o   3 does not find the text above

Ø  00123456

o   123456 does not find the text above

Ø  123,abc,foo,bar,456

o   This is in a CSV file

o   123 nor 456 finds the text above

§  I realize that it has to do with the texted only being separated by 
commas and so it is treated as one token, but I think the issue is no 
different than the others


The expectation from our users is that if they can open the document in its 
default application (Word, Adobe, Notepad, etc.) and perform a "find" within 
that application and find the text, then our application based on Lucene 
should be able to find the same text.


It is not reasonable for us to request that our users surround their search 
with wildcards.  Also, it seems like a kludge to programmatically put wild 
cards around any numeric values the user may enter for searching.


Is there some type of numeric parser or filter that would help me out with 
these scenarios?


I've looked at Solr and we already have strong foundation of code utilizing 
Spring, Hibernate, and Lucene.  Trying to integrate Solr into our 
application would take too much refactoring and time that isn't available 
for this release.


Also, since these numeric values are embedded within the documents, I don't 
think storing them as their own field would make sense since I want to 
maintain the context of the numeric values within the document.


Thank you. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Language detection

2013-06-27 Thread Jack Krupansky
Oops... sorry, I just realized this was on the Lucene-user list. My response 
was for Solr-ONLY!


-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Thursday, June 27, 2013 1:11 PM
To: java-user@lucene.apache.org
Subject: Re: Language detection

You can use the LangDetectLanguageIdentifierUpdateProcessorFactory update
processor to redirect languages to alternate fields, and then set the
non-English fields to be "ignored". But, the document would still be
indexed, just without the redirected text fields.

(Examples of using that update processor are in my book - but not the
"ignored" step.)

There is also a Tika-specific processor as well:
TikaLanguageIdentifierUpdateProcessorFactory

If you really want to completely suppress the indexing of documents
containing non-English text, you'll have to make an explicit check before
sendting the document to Solr. Tika also has language detection, so you
could call Tika from an external process before sending the document to
Solr.

-- Jack Krupansky

-Original Message- 
From: Hang Mang

Sent: Thursday, June 27, 2013 11:45 AM
To: java-user@lucene.apache.org
Subject: Language detection

Hello,

is there some kind of a filter or component that I could use to filter
non-english text? I have a preprocessing step that I only want to index
English documents.

Best,
Gucko


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Language detection

2013-06-27 Thread Jack Krupansky
You can use the LangDetectLanguageIdentifierUpdateProcessorFactory update 
processor to redirect languages to alternate fields, and then set the 
non-English fields to be "ignored". But, the document would still be 
indexed, just without the redirected text fields.


(Examples of using that update processor are in my book - but not the 
"ignored" step.)


There is also a Tika-specific processor as well:
TikaLanguageIdentifierUpdateProcessorFactory

If you really want to completely suppress the indexing of documents 
containing non-English text, you'll have to make an explicit check before 
sendting the document to Solr. Tika also has language detection, so you 
could call Tika from an external process before sending the document to 
Solr.


-- Jack Krupansky

-Original Message- 
From: Hang Mang

Sent: Thursday, June 27, 2013 11:45 AM
To: java-user@lucene.apache.org
Subject: Language detection

Hello,

is there some kind of a filter or component that I could use to filter
non-english text? I have a preprocessing step that I only want to index
English documents.

Best,
Gucko 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Securing stored data using Lucene

2013-06-26 Thread Jack Krupansky
Is your device patient data for a single patient or is it for potentially a 
large number of patients (on a single device)?


Generally, the proper way to "secure" a Lucene/Solr server and its data is 
to keep the server and data behind a firewall and also behind an application 
layer so that no outside process can directly talk to Solr. If you are 
developing a custom server based only on Lucene, then it is up to you to 
assure that your server is secure and Lucene has no role in that security.


Or... are you actually thinking of running Lucene on the device? If so, and 
security is a concern, then you are in UNCHARTED TERRITORY and really on 
your own.


In that case, you probably need to be talking about custom codecs for 
encrypted data for all of your fields. Even then, the in-memory 
representation of the field values would be unencrypted and hence insecure 
if someone took a stolen device and directly examined the memory.


How much "Rich Lucene Search" do you need to do on the device itself, as 
opposed to just looking for "(encrypted) blob storage"? If you want Lucene 
to do your searching on field values, the field values will be exposed in 
memory. If all you want is to retrieve an encrypted blob based on an 
encrypted key, why are you even considering Lucene?


-- Jack Krupansky

-Original Message- 
From: Rafaela Voiculescu

Sent: Wednesday, June 26, 2013 5:06 AM
To: java-user@lucene.apache.org
Subject: Re: Securing stored data using Lucene

Hello,

Thank you all for your help and the suggestions. They are very useful. I
wanted to clarify more aspects of the project, since I overlooked them in
my previous mails.

To explain the use case exactly, the application should work like this:

  - The application works with patient data and that's why we want to keep
  things confidential. We are downloading patient data that can go to the
  mobile device (it should even work on desktop in a similar way really)
  - We have to keep the data in the device due to internet connection
  limitation. The device will get, if lucky, internet connection once or
  twice per week, hence us needing to keep the patient data locally
  - The thing I forgot to mention is that the structure of the patient
  data is kept in json format
  - Currently, there is no plan for using database because the structure
  of the patient stored locally might need to change (so we want to store 
the

  json as document in Lucene).
  - And we need to achieve the part with not having someone who, for
  instance steals the device, able to access the data unless they have the
  encryption key and mechanism and not having someone who's not supposed to
  access the data do that.

This is why we're trying to find a way to encrypt somehow the json
documents and still use Lucene or try not to have the index stored as
plaintext, if it would be possible.

Thank you again for all your help and in case this mail has given more
useful details and there are other suggestions or comments, I would be very
happy to read them.

Have a nice day,
Rafaela


On 25 June 2013 20:59, SUJIT PAL  wrote:


Hi Rafaela,

I built something along these lines as a proof of concept. All data in the
index was unstored and only fields which were searchable (tokenized and
indexed) were kept in the index. The full record was encrypted and stored
in a MongoDB database. A custom Solr component did the search against the
index, gathered up unique ids of the results, then pulled out the 
encrypted

data from MongoDB, decrypted it on the fly and rendered the results.

You can find the (Scala) code here:
https://github.com/sujitpal/solr4-extras
(under the src/main/scala/com/mycompany/solr4extras/secure folder).

More information (more or less the same as what I wrote but probably a bit
more readable with inlined code):

http://sujitpal.blogspot.com/2012/12/searching-encrypted-document-collection.html

There are some obvious data sync concerns with this sort of setup, but as
Adrian points out, you can't index encrypted data.

HTH
Sujit

On Jun 25, 2013, at 4:17 AM, Adrien Grand wrote:

> On Tue, Jun 25, 2013 at 1:03 PM, Rafaela Voiculescu
>  wrote:
>> Hello,
>
> Hi,
>
>> I am sorry I was not a bit more explicit. I am trying to find an
acceptable
>> way to encrypt the data to prevent any access of it in any way unless
the
>> person who is trying to access it knows how to decrypt it. As I
mentioned,
>> I looked a bit through the patch, but I am not sure of its status.
>
> You can encrypt stored fields, but there is no way to do it correctly
> with fields that have positions indexed: attackers could infer the
> actual terms based on the order of terms (the encrypted version must
> sort the same way as the original terms), frequencies and positions.
>
> --
> Adrien
>
> -
> To 

Re: New Lucene User

2013-06-17 Thread Jack Krupansky
Try starting with Solr. You can have your search server up and running 
without writing any code. And Solr's Data Import Handler can load data 
direct from the database.


-- Jack Krupansky

-Original Message- 
From: raghavendra.k@barclays.com

Sent: Monday, June 17, 2013 5:03 PM
To: java-user@lucene.apache.org
Subject: New Lucene User

Hi,

I have a requirement to perform a full-text search in a new application and 
I came across Lucene and I want to check if it helps our cause.


Requirement:

I have a SQL Server database table with around 70 million records in it. It 
is not a live table and the data gets appended to it on a daily basis.


The table has about 30 columns. The user will provide one string, and this 
value has to be searched against 20 columns for each record. All matching 
records need to be displayed in the UI.


My Analysis

Based on what I have read until now about Lucene, I believe I need to 
convert my database table data into a flat file, generate indexes and then 
perform the search.


Questions


-  To begin with, is Lucene a good option for this kind of 
requirement? Note: Let us ignore daily index generation and UI display for 
this discussion.


-  Should the entire data of 70 million records exist in one flat 
file?


-  How do I define what fields (20 columns) should be searched among 
the complete list (30 columns)?


As I am just starting off, I may not even know about other dependencies. I 
kindly request you to provide clarifications / reference to an example that 
would suit my case.


Please let me know if you have any questions.

Thanks,
Raghu


___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.


For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.


___ 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: compare paragraphs of text - which Query Class to use?

2013-06-14 Thread Jack Krupansky
First, start with Solr and use the edismax query parser with the default 
query operator as "OR" and set pf, pf2, and pf3, and then simply query by 
the raw text of the paragraph. This will order the results by how closely 
the indexed paragraphs match the query paragraph.


This is also a good technique for detecting plagiarism where a lot of the 
text is similar if not identical.


Once you get experience using this technique in Solr, then simply look at 
the parsed query that edismax generates and do the same in your Lucene Java 
code.


-- Jack Krupansky

-Original Message- 
From: Malgorzata Urbanska

Sent: Friday, June 14, 2013 12:23 PM
To: java-user@lucene.apache.org
Subject: compare paragraphs of text - which Query Class to use?

Hello,

I've just started using Lucene and I'm not sure which Query Classes I
should use in my project.

My goal is to compare paragraphs of text. Paragraph A is a query and
paragraph B is a document for which I would like to calculate similarity
score.

the paragraphs A and B can be in some situations exactly the same or not.
Generally I would like to check do they talk about the same topic.

In my project I have set of paragraphs A and set of paragraphs B, so I'm
looking for some universal solution which allow me to check similarity
score for each paragraph A all paragraphs B.

Do you have any suggestions? I really appreciate all of the ideas.

--
gosia 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Indexes explanantion

2013-06-10 Thread Jack Krupansky
Sorry, but you really are going to need to work on your "Lucene Basics" 
before you tackle such an ambitious effort.


The Lucene JavaDoc, Solr Wikis, Stack Underflow, blogs of McCandless, et al, 
and Google search in general will cover a lot of the ground, including those 
mysterious terms:


- Indexed terms
- Stored values
- Payloads
- DocValues

-- Jack Krupansky

-Original Message- 
From: nikhil desai

Sent: Monday, June 10, 2013 8:36 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Indexes explanantion

I don't think I could get much from what you said, could you please
elaborate? Appreciate.

On Mon, Jun 10, 2013 at 5:20 PM, Jack Krupansky 
wrote:



Your stored value could be very different from your indexed (searchable)
value. You can also associate payloads with an indexed term. And there are
DocValues as well.


-- Jack Krupansky

-Original Message- From: nikhil desai
Sent: Monday, June 10, 2013 8:06 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Indexes explanantion


Sure. Thanks Jack.
I don't have much experience working with Lucene, however, here is what I
am trying to resolve.

I learned that the Custom attributes cannot be used for indexing or
searching purposes. However I wanted the attributes to be used for 
indexing

and searching. So I created custom attributes and inserted them as tokens
into the tokenstream by assigning positionIncrement attribute to 0. Now
since my new token stream has attributes(as tokens) and they are used 
while

indexing, I can now search the document based on the attributes(tokens I
newly inserted). However I still have an issue. And by the way I have a 
lot

of attributes that I need to assign to an individual token.

Ex: Sentence: "LinkedIn is famous"
After passing through custom analyzer and few filters that I have written
and appending Attributes to the tokens, the new Tokenstream we get is
"LinkedIn Noun SocialSite famous JJ Positive" - (what that means is that
LinkedIn is Noun and is also an Socialsite, famous is an adjective and 
also

a Positive word, 'is' is removed as it does not make sense to index 'is')

This is now definitely searchable based on Attributes(Here: Noun,
SocialSite, JJ, Positive).

However, since I have put this entire text "LinkedIn is famous" as a Field
while adding a Document, when I search for say "SocialSite", I get a
Document as an output which has "LinkedIn is famous" as one of the fields.

However, is it possible to get only "LinkedIn" as output rather than an
entire text? i.e Only the actual token(the token present in the original
input) as output?
Another example: if I search for say "Positive" I should get "famous" as
output and not the entire "LinkedIn is famous".

I know that if I put it as a Field in the document, I should be able to 
get
it, but how do I add such a Field? because, only when the Tokens are 
passed
through the filters we get to know what all Attributes would be attached 
to

it, so while we do indexwriter.addDocument() we have no idea about the
Attributes.

The typical problem that I see is the indexing is done based on the new
tokenstream which is good, but when it retrieves the Document, it has the
older actual Tokenstream(or actual input) and that is what is given as
output.

Does that make any sense? Or I have a typical use case that does not go
well with Lucene?

Any help comments are appreciated.

On Mon, Jun 10, 2013 at 1:32 PM, Jack Krupansky *
*wrote:

 Even though you've posted for Lucene, you might want to consider taking a

look at Solr because Solr has an Admin UI with an Analysis page which
gives
you a nice display of how index and query text is analyzed into tokens,
terms, and attributes - all of which Solr inherits from Lucene.

And check out the unit tests for Lucene (and Solr) for indexing. Then you
can actually step through code and see it happen.

Otherwise, google for blogs on various sub-topics of interest with
specific terms.

OTOH... don't try diving too deeply until you've written and understood a
fair amount of Java code using Lucene. Otherwise, you won't have enough
context to understand or even ask intelligent questions.

-- Jack Krupansky

-Original Message- From: nikhil desai
Sent: Monday, June 10, 2013 1:24 PM
To: java-user@lucene.apache.org
Subject: Lucene Indexes explanantion


Hello,

My first time post in this group.

I have been using Lucene recently. I have a question.

Where can I find a good explanation on Indexes. Or rather how indexing
(Not
really the mathematical aspect) happens in Lucene, what all
attributes(charTerm, Offset etc) come into play? And the way it is
implemented? I checked the "Lucene In Action" and could not find much on
actual indexing, what all classes etc are being used.

Appreciate your help.

Thanks
NIKHIL

---

Re: Lucene Indexes explanantion

2013-06-10 Thread Jack Krupansky
Your stored value could be very different from your indexed (searchable) 
value. You can also associate payloads with an indexed term. And there are 
DocValues as well.


-- Jack Krupansky

-Original Message- 
From: nikhil desai

Sent: Monday, June 10, 2013 8:06 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Indexes explanantion

Sure. Thanks Jack.
I don't have much experience working with Lucene, however, here is what I
am trying to resolve.

I learned that the Custom attributes cannot be used for indexing or
searching purposes. However I wanted the attributes to be used for indexing
and searching. So I created custom attributes and inserted them as tokens
into the tokenstream by assigning positionIncrement attribute to 0. Now
since my new token stream has attributes(as tokens) and they are used while
indexing, I can now search the document based on the attributes(tokens I
newly inserted). However I still have an issue. And by the way I have a lot
of attributes that I need to assign to an individual token.

Ex: Sentence: "LinkedIn is famous"
After passing through custom analyzer and few filters that I have written
and appending Attributes to the tokens, the new Tokenstream we get is
"LinkedIn Noun SocialSite famous JJ Positive" - (what that means is that
LinkedIn is Noun and is also an Socialsite, famous is an adjective and also
a Positive word, 'is' is removed as it does not make sense to index 'is')

This is now definitely searchable based on Attributes(Here: Noun,
SocialSite, JJ, Positive).

However, since I have put this entire text "LinkedIn is famous" as a Field
while adding a Document, when I search for say "SocialSite", I get a
Document as an output which has "LinkedIn is famous" as one of the fields.

However, is it possible to get only "LinkedIn" as output rather than an
entire text? i.e Only the actual token(the token present in the original
input) as output?
Another example: if I search for say "Positive" I should get "famous" as
output and not the entire "LinkedIn is famous".

I know that if I put it as a Field in the document, I should be able to get
it, but how do I add such a Field? because, only when the Tokens are passed
through the filters we get to know what all Attributes would be attached to
it, so while we do indexwriter.addDocument() we have no idea about the
Attributes.

The typical problem that I see is the indexing is done based on the new
tokenstream which is good, but when it retrieves the Document, it has the
older actual Tokenstream(or actual input) and that is what is given as
output.

Does that make any sense? Or I have a typical use case that does not go
well with Lucene?

Any help comments are appreciated.

On Mon, Jun 10, 2013 at 1:32 PM, Jack Krupansky 
wrote:



Even though you've posted for Lucene, you might want to consider taking a
look at Solr because Solr has an Admin UI with an Analysis page which 
gives

you a nice display of how index and query text is analyzed into tokens,
terms, and attributes - all of which Solr inherits from Lucene.

And check out the unit tests for Lucene (and Solr) for indexing. Then you
can actually step through code and see it happen.

Otherwise, google for blogs on various sub-topics of interest with
specific terms.

OTOH... don't try diving too deeply until you've written and understood a
fair amount of Java code using Lucene. Otherwise, you won't have enough
context to understand or even ask intelligent questions.

-- Jack Krupansky

-Original Message- From: nikhil desai
Sent: Monday, June 10, 2013 1:24 PM
To: java-user@lucene.apache.org
Subject: Lucene Indexes explanantion


Hello,

My first time post in this group.

I have been using Lucene recently. I have a question.

Where can I find a good explanation on Indexes. Or rather how indexing 
(Not

really the mathematical aspect) happens in Lucene, what all
attributes(charTerm, Offset etc) come into play? And the way it is
implemented? I checked the "Lucene In Action" and could not find much on
actual indexing, what all classes etc are being used.

Appreciate your help.

Thanks
NIKHIL

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Indexes explanantion

2013-06-10 Thread Jack Krupansky
Even though you've posted for Lucene, you might want to consider taking a 
look at Solr because Solr has an Admin UI with an Analysis page which gives 
you a nice display of how index and query text is analyzed into tokens, 
terms, and attributes - all of which Solr inherits from Lucene.


And check out the unit tests for Lucene (and Solr) for indexing. Then you 
can actually step through code and see it happen.


Otherwise, google for blogs on various sub-topics of interest with specific 
terms.


OTOH... don't try diving too deeply until you've written and understood a 
fair amount of Java code using Lucene. Otherwise, you won't have enough 
context to understand or even ask intelligent questions.


-- Jack Krupansky

-Original Message- 
From: nikhil desai

Sent: Monday, June 10, 2013 1:24 PM
To: java-user@lucene.apache.org
Subject: Lucene Indexes explanantion

Hello,

My first time post in this group.

I have been using Lucene recently. I have a question.

Where can I find a good explanation on Indexes. Or rather how indexing (Not
really the mathematical aspect) happens in Lucene, what all
attributes(charTerm, Offset etc) come into play? And the way it is
implemented? I checked the "Lucene In Action" and could not find much on
actual indexing, what all classes etc are being used.

Appreciate your help.

Thanks
NIKHIL 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting position increments directly from the the index

2013-05-23 Thread Jack Krupansky
If you add a special "end of document term" then some of these calculations 
might be easier.


And, give that special term a payload of the sentence count.

While you're at it, insert "end of sentence" terms that could have a a 
payload of the sentence number.


-- Jack Krupansky
-Original Message- 
From: Michael McCandless

Sent: Thursday, May 23, 2013 10:39 AM
To: Lucene Users
Subject: Re: Getting position increments directly from the the index

On Thu, May 23, 2013 at 9:54 AM, Igor Shalyminov
 wrote:

But, just to clarify, is there a way to get, let's say, a vector of 
position increments directly from the index, without re-parsing document 
contents?


Term vectors (as Jack suggested) are one option, but they are very
heavy (slows down indexing, takes lots of disk space, slow
(seek-per-document) to load at search time).

You can enumerate all positions for each termXdoc in the postings, but
you'd then need to collate by document to get the max position (last
term) for that document.  I guess an int[maxDoc] would do the trick,
then walk that array dividing each maxPosition by 1000.  Or index the
sentence token :)

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting position increments directly from the the index

2013-05-23 Thread Jack Krupansky

Take a look at the Term Vectors Component:
http://wiki.apache.org/solr/TermVectorComponent

-- Jack Krupansky

-Original Message- 
From: Igor Shalyminov

Sent: Thursday, May 23, 2013 9:54 AM
To: java-user@lucene.apache.org
Subject: Re: Getting position increments directly from the the index

Thanks, Mike and Jack!

Those are really good options.
But, just to clarify, is there a way to get, let's say, a vector of position 
increments directly from the index, without re-parsing document contents?


--
Best Regards,
Igor

23.05.2013, 16:13, "Jack Krupansky" :

It might be nice to inquire as to the largest position for a field in a
document. Is that information kept anywhere? Not that I know of, although 
I

suppose it can be calculated at runtime by running though all the terms of
the field. Then he could just divide by 1000.

-- Jack Krupansky

-Original Message-
From: Michael McCandless
Sent: Thursday, May 23, 2013 6:28 AM
To: Lucene Users
Subject: Re: Getting position increments directly from the the index

Do you actually index the sentence boundary as a token?  If so, you
could just get the totalTermFreq of that token?

Mike McCandless

http://blog.mikemccandless.com

On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov
 wrote:


 Hello!

 I'm storing sentence bounds in the index as position increments of 1000.
 I want to get the total number of sentences in the index, i. e. the 
number

 of "1000" increment values.
 Can I do that some other way rather than just loading each document and
 extracting position increments with a custom Analyzer?

 --
 Best Regards,
 Igor Shalyminov

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting position increments directly from the the index

2013-05-23 Thread Jack Krupansky
It might be nice to inquire as to the largest position for a field in a 
document. Is that information kept anywhere? Not that I know of, although I 
suppose it can be calculated at runtime by running though all the terms of 
the field. Then he could just divide by 1000.


-- Jack Krupansky

-Original Message- 
From: Michael McCandless

Sent: Thursday, May 23, 2013 6:28 AM
To: Lucene Users
Subject: Re: Getting position increments directly from the the index

Do you actually index the sentence boundary as a token?  If so, you
could just get the totalTermFreq of that token?


Mike McCandless

http://blog.mikemccandless.com


On Wed, May 22, 2013 at 10:11 AM, Igor Shalyminov
 wrote:

Hello!

I'm storing sentence bounds in the index as position increments of 1000.
I want to get the total number of sentences in the index, i. e. the number 
of "1000" increment values.
Can I do that some other way rather than just loading each document and 
extracting position increments with a custom Analyzer?


--
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query with phrases, wildcards and fuzziness

2013-05-22 Thread Jack Krupansky
Using BooleanQuery and Should is the way to go. There are some nuances, but 
you may not run into them. Sometimes it is more that the query parser syntax 
is the issue rather than the Lucene BQ itself. For example, with a string of 
AND and OR, they all get parsed into a single BQ, which is clearly not 
traditional "Boolean", but if you code multiple BQs that nest (or fully 
parenthesize your source query), you will get a true "Boolean" query. It's 
up to your particular application whether you need "true" Boolean or not.


-- Jack Krupansky

-Original Message- 
From: Ross Simpson

Sent: Wednesday, May 22, 2013 7:44 AM
To: java-user@lucene.apache.org
Subject: Re: Query with phrases, wildcards and fuzziness

One further question:

If I wanted to construct my query using Query implementations instead of
a QueryParser (e.g. TermQuery, WildcardQuery, etc.), what's the right
way to duplicate the "OR" functionality I wrote about below?  As I
mentioned, I've read that wrapping query objects in a BooleanQuery and
using Occur.SHOULD is not necessarily the same.

Any suggestions?

Ross


On 22/05/2013 11:46 AM, Ross Simpson wrote:
Jack, thanks very much!  I wasn't considering a space a special character 
for some reason.  That has worked perfectly.


Cheers,
Ross


On May 22, 2013, at 10:24 AM, Jack Krupansky wrote:


Just escape embedded spaces with a backslash.

-- Jack Krupansky

-Original Message- From: Ross Simpson
Sent: Tuesday, May 21, 2013 8:08 PM
To: java-user@lucene.apache.org
Subject: Query with phrases, wildcards and fuzziness

Hi all,

I'm trying to create a fairly complex query, and having trouble 
constructing it.


My index contains a TextField with place names as strings, e.g.:
Port Melbourne, VIC 3207

I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so 
that my strings are not tokenized at all.


I want to support end-user searches like the following, and have them 
match that string above:

Port Melbourne, VIC 3207 (exact)
Port (prefix)
Port Mel (prefix, including a space)
Melbo (wildcard)
Melburne (fuzzy)

I'm trying to get away with not parsing the query myself, and just 
constructing something like this:
parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR 
(STRING~1^3) );


That doesn't seem to work, neither with QueryParser nor with 
ComplexPhraseQueryParser.  Specifically, I'm having trouble getting 
appropriate results when there's a space in the input string, notable 
with the wildcard match part (it ends up returning everything in the 
index).


Is my approach above possible?  I also have had a look at using specific 
Query implementations and combining them in a BooleanQuery, but I'm not 
quite sure how to replicate the "OR" behavior I want (from reading, 
Occur.SHOULD is not equivalent or "OR").



Any suggestions would be appreciated.

Thanks!
Ross




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Case insensitive StringField?

2013-05-21 Thread Jack Krupansky
Yes it is. It always will. But... you can escape the spaces with a 
backslash:


Query q = qp.parse("new\\ york");

-- Jack Krupansky
-Original Message- 
From: Shahak Nagiel

Sent: Tuesday, May 21, 2013 10:09 PM
To: java-user@lucene.apache.org
Subject: Re: Case insensitive StringField?

Jack / Michael: Thanks, but the query parser still seems to be tokenizing 
the query?


public class StringPhraseAnalyzer extends Analyzer  {
   protected TokenStreamComponents createComponents (String fieldName, 
Reader reader) {

   Tokenizer tok = new KeywordTokenizer(reader);
   TokenFilter filter = new LowerCaseFilter(Version.LUCENE_41, tok);
   filter = new TrimFilter(filter, true);
   return new TokenStreamComponents(tok, filter);
   }
}

...

Analyzer analyzer = new StringPhraseAnalyzer();

// using this analyzer, add document to index with city TextField (value 
"NEW YORK")


QueryParser qp = new QueryParser(Version.LUCENE_41, "city", analyzer);

Query q = qp.parse("new york");
System.out.println ("Query: " + q);


results in...
Query: city:new city:york// I expected "city:new york"

...and no matches.  Is a QueryParser the wrong way to generate the query for 
this type of analyzer?



Thanks again!





From: Jack Krupansky 
To: java-user@lucene.apache.org
Sent: Tuesday, May 21, 2013 10:22 AM
Subject: Re: Case insensitive StringField?


To be clear, analysis is not supported on StringField (or any non-tokenized
field). But the good news is that by using the keyword tokenizer
(KeywordTokenizer) on a TextField, you can get the same effect.

That will preserve the entire input as a single token. You may want to
include filters to trim exterior white space and normalize interior white
space.

-- Jack Krupansky

-Original Message- 
From: Shahak Nagiel

Sent: Tuesday, May 21, 2013 10:06 AM
To: java-user@lucene.apache.org
Subject: Case insensitive StringField?

It appears that StringField instances are treated as literals, even though
my analyzer lower-cases (on both write and read sides).  So, for example, I
can match with a term query (e.g. "NEW YORK"), but only if the case matches.
If I use a QueryParser (or MultiFieldQueryParser), it never works because
these query values are lowercased and don't match.

I've found that using a TextField instead works, presumably because it's
tokenized and processed correctly by the write analyzer.  However, I would
prefer that queries match against the entire/exact phrase ("NEW YORK"),
rather than among the tokens ("NEW" or "YORK").

What's the solution here?

Thanks in advance.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query with phrases, wildcards and fuzziness

2013-05-21 Thread Jack Krupansky

Just escape embedded spaces with a backslash.

-- Jack Krupansky

-Original Message- 
From: Ross Simpson

Sent: Tuesday, May 21, 2013 8:08 PM
To: java-user@lucene.apache.org
Subject: Query with phrases, wildcards and fuzziness

Hi all,

I'm trying to create a fairly complex query, and having trouble constructing 
it.


My index contains a TextField with place names as strings, e.g.:
Port Melbourne, VIC 3207

I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so 
that my strings are not tokenized at all.


I want to support end-user searches like the following, and have them match 
that string above:

Port Melbourne, VIC 3207 (exact)
Port (prefix)
Port Mel (prefix, including a space)
Melbo (wildcard)
Melburne (fuzzy)

I'm trying to get away with not parsing the query myself, and just 
constructing something like this:

parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) );

That doesn't seem to work, neither with QueryParser nor with 
ComplexPhraseQueryParser.  Specifically, I'm having trouble getting 
appropriate results when there's a space in the input string, notable with 
the wildcard match part (it ends up returning everything in the index).


Is my approach above possible?  I also have had a look at using specific 
Query implementations and combining them in a BooleanQuery, but I'm not 
quite sure how to replicate the "OR" behavior I want (from reading, 
Occur.SHOULD is not equivalent or "OR").



Any suggestions would be appreciated.

Thanks!
Ross




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Case insensitive StringField?

2013-05-21 Thread Jack Krupansky
To be clear, analysis is not supported on StringField (or any non-tokenized 
field). But the good news is that by using the keyword tokenizer 
(KeywordTokenizer) on a TextField, you can get the same effect.


That will preserve the entire input as a single token. You may want to 
include filters to trim exterior white space and normalize interior white 
space.


-- Jack Krupansky

-Original Message- 
From: Shahak Nagiel

Sent: Tuesday, May 21, 2013 10:06 AM
To: java-user@lucene.apache.org
Subject: Case insensitive StringField?

It appears that StringField instances are treated as literals, even though 
my analyzer lower-cases (on both write and read sides).  So, for example, I 
can match with a term query (e.g. "NEW YORK"), but only if the case matches. 
If I use a QueryParser (or MultiFieldQueryParser), it never works because 
these query values are lowercased and don't match.


I've found that using a TextField instead works, presumably because it's 
tokenized and processed correctly by the write analyzer.  However, I would 
prefer that queries match against the entire/exact phrase ("NEW YORK"), 
rather than among the tokens ("NEW" or "YORK").


What's the solution here?

Thanks in advance. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: classic.QueryParser - bug or new behavior?

2013-05-19 Thread Jack Krupansky
Yeah, just go ahead and escape the slash, either with a backslash or by 
enclosing the whole term in quotes. Otherwise the slash (even embedded in 
the middle of a term!) indicates the start of a regex query term.


-- Jack Krupansky

-Original Message- 
From: Scott Smith

Sent: Sunday, May 19, 2013 2:50 PM
To: java-user@lucene.apache.org
Subject: classic.QueryParser - bug or new behavior?

I just upgraded from lucene 4.1 to 4.2.1.  We believe we are seeing some 
different behavior.


I'm using org.apache.lucene.queryparser.classic.QueryParser.  If I pass the 
string "20110920/EXPIRED" (w/o quotes) to the parser, I get:


org.apache.lucene.queryparser.classic.ParseException: Cannot parse 
'20110920/EXPIRED': Lexical error at line 1, column 17.  Encountered:  
after : "/EXPIRED"
  at 
org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:131)


We believe this used to work.

I tried googling for this and found something that said I should use 
QueryParser.escape() on the string before passing it to the parser. 
However, that seems to break phrase queries (e.g., "John Smith" - with the 
quotes; I assume it's escaping the double-quotes and doesn't realize it's a 
phrase).


Since it is a forward slash, I'm confused why it would need escaping of any 
of the characters in the string with the "/EXPIRED".


Has anyone seen this?

Scott 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene and mongodb

2013-05-14 Thread Jack Krupansky
That was tried with Lucandra/Solandra, which stored the Lucene index in 
Cassandra, but was less than optimal, so that model was discarded in favor 
of indexing Cassandra data directly into Solr/Lucene, side-by-side in each 
Cassandra node, but in native Lucene. The latter approach is now available 
from DataStax as DataStax Enterprise (DSE), which also integrates Hadoop. 
DSE provides the best of Cassandra integrated tightly with the best of Solr.


In DSE, Cassandra takes care of all the cluster management, with Solr 
indexing the local Cassandra data, handing off incoming updates to Cassandra 
for normal Cassandra storage, and using a variation of the normal Solr code 
for distributing queries and merging results from other nodes, but depending 
on Cassandra for information about cluster configuration.


(Note: I had proposed to talk in more detail about the above at Lucene 
Revolution, but my proposal was not accepted.)


See:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

As it says, "DataStax Enterprise is completely free for development work."

-- Jack Krupansky

-Original Message- 
From: Rider Carrion Cleger

Sent: Tuesday, May 14, 2013 4:35 AM
To: java-user-i...@lucene.apache.org ; java-user-...@lucene.apache.org ; 
java-user@lucene.apache.org

Subject: lucene and mongodb

Hi team,
I'm working with apache lucene 4.2.1 and I would like to store lucene index
in a NoSql database.
So my questions are,
- Can I store the lucene index in a mongodb database ?

thanks you team! 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?

2013-05-13 Thread Jack Krupansky

You'll have to be more explicit about the actual data and what didn't work.

Try developing a simple, self-contained unit test with some simple strings 
as input that demonstrates the case that you say doesn't work.


I mean, regular expressions and field analysis can both be quite tricky - 
even a tiny typo can break everything.


To be clear, my suggested approach to using regular expressions works only 
on un-tokenized input, so there won't be any positions or even offsets.


Other than that, you're on your own until you develop that small, 
self-contained unit test.


Or, you can file a Jira for a new Lucene Query for phrase and or span 
queries that measures distance by offsets rather than positions.


-- Jack Krupansky

-Original Message- 
From: wgggfiy

Sent: Monday, May 13, 2013 3:47 AM
To: java-user@lucene.apache.org
Subject: Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?

Jack, according to you, How can I implemt this requirement ?Could you give 
me

a clue ? thank you very much.The regex query seemed not worked ? I got the
field such asFieldType fieldType = new FieldType();
FieldInfo.IndexOptions indexOptions =
FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
fieldType.setIndexOptions(indexOptions);fieldType.setIndexed(true);
fieldType.setTokenized(true);fieldType.setStored(true);
fieldType.freeze();return new Field(name, value, fieldType);



-
--
Email: wuqiu.m...@qq.com
--
--
View this message in context: 
http://lucene.472066.n3.nabble.com/PhraseQuery-Can-jakarta-apache-10-be-searched-by-offset-tp4061243p4062852.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?

2013-05-06 Thread Jack Krupansky
Do you mean the raw character offsets of the starting and ending characters 
of the terms?


No.

Although, if you index the text as a raw string, you might be able to come 
up with a regex query like "jakarta.{1,10}apache"


-- Jack Krupansky

-Original Message- 
From: wgggfiy

Sent: Monday, May 06, 2013 11:39 PM
To: java-user@lucene.apache.org
Subject: [PhraseQuery] Can "jakarta apache"~10 be searched by offset ?

As I know, the syntax *"jakarta apache"~10*, which is a PhraseQuery with a
slop=10 in position, but What I want is *based on offset* not on position?
Anyone can help me ? thx.



-
--
Email: wuqiu.m...@qq.com
--
--
View this message in context: 
http://lucene.472066.n3.nabble.com/PhraseQuery-Can-jakarta-apache-10-be-searched-by-offset-tp4061243.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multiple PositionIncrement attributes

2013-04-25 Thread Jack Krupansky

You can use SpanNearQuery to seek matches within a specified distance.

Lucene knows nothing about "sentences". But if you have an analyzer or 
custom code that artificially bumps the position to the next multiple of 
some number like 100 or 1000 when a sentence boundary pattern is 
encountered, you could use that number times n to match within n sentences, 
roughly, plus or minus a sentence or two - there is nothing to cause the 
nearness to be rounded or truncated exactly to one of those boundaries.


Maybe you want two numbers: 1) sentence separation, say 1000, and 2) maximum 
sentence length, say 500. The SpanNearQuery would use n-1 times the sentence 
separation plus the maximum sentence length. Well, you have to adjust that 
for how you count sentences - is 1 the current sentence or is that 0?


-- Jack Krupansky

-Original Message- 
From: Igor Shalyminov

Sent: Thursday, April 25, 2013 6:54 AM
To: java-user@lucene.apache.org
Subject: Multiple PositionIncrement attributes

Hi all!

I use PositionIncrement attribute for finding words at some distance from 
each other. And I have two problems with that:
1) I want to search words within one sentence. A possible solution would be 
to set PositionIncrement of +INF (like 30 :) ) to the sentence break tag.
2) I want to use in my search both word-distance and sentence-distance 
between words (e.g. find the word "Putin" within 3 sentences after the word 
"Obama" or find the words "cheese" and "bacon" in one sentence within 3 
words of each other).


For the 2nd problem, is there a way of storing multiple position information 
sources in the index and using them for searching? Say, at least choosing 
one of those for a query.



--
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

2013-04-15 Thread Jack Krupansky
Yes, reset was always "mandatory" from an API contract sense, but not always 
enforced in a practical sense in 3.x (no uniformly extreme negative 
consequences), as the original emailer indicated. Now, it is "mandatory" in 
a practical sense as well (extremely annoying consequences in all cases of a 
contract violation). So, I should have said that the contract was mandatory 
but not enforced... which from a practical perspective negates its mandatory 
contractual value.


-- Jack Krupansky

-Original Message- 
From: Uwe Schindler

Sent: Monday, April 15, 2013 11:53 AM
To: java-user@lucene.apache.org
Subject: RE: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

Hi,

It was always mandatory! In Lucene 2.x/3.x some Tokenizers just returned 
bogus, undefined stuff if not correctly reset before usage, especially when 
Tokenizers are "reused" by the Analyzer, which is now mandatory in 4.x. So 
we made it throw some Exception (NPE or AIOOBE) in Lucene 4 by initializing 
the state fields in Lucene 4.0 with some default values that cause the 
Exception. The Exception is not more specified because of performance 
reasons (it's just caused by the new default values set in ctor previously).


-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-----Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Monday, April 15, 2013 4:25 PM
To: java-user@lucene.apache.org
Subject: Re: WhitespaceTokenizer, incrementToke()
ArrayOutOfBoundException

I didn't read your code, but do you have the "reset" that is now mandatory
and throws AIOOBE if not present?

-- Jack Krupansky

-Original Message-
From: andi rexha
Sent: Monday, April 15, 2013 10:21 AM
To: java-user@lucene.apache.org
Subject: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

Hi,
I have tryed to get all the tokens from a TokenStream in the same way as I
was doing in the 3.x version of Lucene, but now (at least with
WhitespaceTokenizer) I get an exception:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
at java.lang.Character.codePointAtImpl(Character.java:2405)
at java.lang.Character.codePointAt(Character.java:2369)
at
org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePoint
At(CharacterUtils.java:164)
at
org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokeniz
er.java:166)



The code is quite simple, and I thought that it could have worked, but
obviously it doesn't (unless I have made some mistakes).

Here is the code, in case you spot some bugs on it (although it is 
trivial):

String str = "this is a test";
Reader reader = new StringReader(str);
TokenStream tokenStream = new
WhitespaceTokenizer(Version.LUCENE_42,
reader);  //tokenStreamAnalyzer.tokenStream("test", reader);
CharTermAttribute attribute =
tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
System.out.println(new String(attribute.buffer(), 0,
attribute.length()));
}

Hope you have any idea of why it is happening.
Regards,
Andi



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

2013-04-15 Thread Jack Krupansky
I didn't read your code, but do you have the "reset" that is now mandatory 
and throws AIOOBE if not present?


-- Jack Krupansky

-Original Message- 
From: andi rexha

Sent: Monday, April 15, 2013 10:21 AM
To: java-user@lucene.apache.org
Subject: WhitespaceTokenizer, incrementToke() ArrayOutOfBoundException

Hi,
I have tryed to get all the tokens from a TokenStream in the same way as I 
was doing in the 3.x version of Lucene, but now (at least with 
WhitespaceTokenizer) I get an exception:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
   at java.lang.Character.codePointAtImpl(Character.java:2405)
   at java.lang.Character.codePointAt(Character.java:2369)
   at 
org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
   at 
org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)




The code is quite simple, and I thought that it could have worked, but 
obviously it doesn't (unless I have made some mistakes).


Here is the code, in case you spot some bugs on it (although it is trivial):
String str = "this is a test";
   Reader reader = new StringReader(str);
   TokenStream tokenStream = new WhitespaceTokenizer(Version.LUCENE_42, 
reader);  //tokenStreamAnalyzer.tokenStream("test", reader);
   CharTermAttribute attribute = 
tokenStream.getAttribute(CharTermAttribute.class);

   while (tokenStream.incrementToken()) {
   System.out.println(new String(attribute.buffer(), 0, 
attribute.length()));

   }

Hope you have any idea of why it is happening.
Regards,
Andi



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to index Sharepoint files with Lucene

2013-04-10 Thread Jack Krupansky
The Apache ManifoldCF "connector framework" has a SharePoint connector that 
can crawl SharePoint repositories. It has an output connector that feeds 
into Solr/SolrCell, but you can easily develop a connector that outputs 
whatever you want - like put the crawled files into a file system directory, 
or maybe even send each file directly into Tika and then directly index the 
content into Lucene, if that's what you want. In any case, MCF handles the 
SharePoint access and crawling.


See:
http://manifoldcf.apache.org/en_US/index.html

-- Jack Krupansky

-Original Message- 
From: Álvaro Vargas Quezada

Sent: Wednesday, April 10, 2013 5:31 PM
To: java-user@lucene.apache.org
Subject: How to index Sharepoint files with Lucene

Hi everyone!
I'm trying to combine Lucene with Sharepoint (we use Windows and SP 2010), 
but I couldn't find good tutorials or proven tests cases that demostrate 
this integration. Do you know any tutorial or can give me some help about 
this?
I have read all the "Lucene in Action" but here just talk about indexing 
files, and not integration with other softwares.
Thanks in advanceGreetz from Chile 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to consider special character like + ! @ # in Lucene search

2013-04-10 Thread Jack Krupansky
You'll have to switch from using the standard analyzer/tokenizer to using 
the whitespace analyzer/tokenizer, and make sure not to use any additional 
token filters that might eliminate some or all special characters (or 
provide character maps for the ones that do accept character maps.) You will 
need to completely reindex your data as well. And, in some cases you may 
need to escape some special characters with backslash in your queries.


-- Jack Krupansky

-Original Message- 
From: neeraj shah

Sent: Wednesday, April 10, 2013 2:07 AM
To: java-user@lucene.apache.org
Subject: how to consider special character like + ! @ # in Lucene search

hello, im using Lucene2.9.


i have to search special character like "/" in given text. but when im
searching it gives me 0 hit.
I have tried  QueryParse.escape("/").  but did not get the result.
how to proceed further. please help..

--
With Regards,
Neeraj Kumar Shah 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MLT Using a Query created in a different index

2013-04-05 Thread Jack Krupansky
In a statistical sense, for the majority of documents, yes, but you could 
probably find quite a few outlier examples where the results from A to B or 
from B to A as significantly or even completely different or even 
non-existent.


-- Jack Krupansky

-Original Message- 
From: Peter Lavin

Sent: Friday, April 05, 2013 3:49 AM
To: java-user@lucene.apache.org
Subject: Re: MLT Using a Query created in a different index


Thanks for that Jack,

so it's fair to say that if both the sources and target corpus are large
and diverse, then the impact of using a different index to create the
query would be negligible.

P.

On 04/04/2013 06:49 PM, Jack Krupansky wrote:

The heart of MLT is examining the top result of a query (or maybe more
than one) and identifying the "top" terms from the top document(s) and
then simply using those top terms for a subsequent query. The term
ranking would of course depend on term frequency, and other relevancy
considerations - for the corpus of the original query. A rich query
corpus will give great results, a weak corpus will give weak results -
no matter how rich or weak the final target corpus is. OTOH, if the
target corpus really is representative on the source corpus, then
results should be either good or terrible - the selected/query document
may not have any representation in the target corpus.

-- Jack Krupansky

-Original Message- From: Peter Lavin
Sent: Thursday, April 04, 2013 1:06 PM
To: java-user@lucene.apache.org
Subject: MLT Using a Query created in a different index


Dear Users,

I am doing some research where Lucene is integrated into agent
technology. Part of this work involves using an MLT query in an index
which was not created from a document in that index (i.e. the query is
created, serialised and sent to the remote agent).

Can anyone point me towards any information on what the potential impact
of doing this would be?

I'm assuming if both indexes have similar sets of documents, the impact
would be negligible, but what, for example would be the impact of
creating an MLT query from an index with only one or two documents for
use in an index with several (say 100+) documents,

with thanks,
Peter



--
with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MLT Using a Query created in a different index

2013-04-04 Thread Jack Krupansky
The heart of MLT is examining the top result of a query (or maybe more than 
one) and identifying the "top" terms from the top document(s) and then 
simply using those top terms for a subsequent query. The term ranking would 
of course depend on term frequency, and other relevancy considerations - for 
the corpus of the original query. A rich query corpus will give great 
results, a weak corpus will give weak results - no matter how rich or weak 
the final target corpus is. OTOH, if the target corpus really is 
representative on the source corpus, then results should be either good or 
terrible - the selected/query document may not have any representation in 
the target corpus.


-- Jack Krupansky

-Original Message- 
From: Peter Lavin

Sent: Thursday, April 04, 2013 1:06 PM
To: java-user@lucene.apache.org
Subject: MLT Using a Query created in a different index


Dear Users,

I am doing some research where Lucene is integrated into agent
technology. Part of this work involves using an MLT query in an index
which was not created from a document in that index (i.e. the query is
created, serialised and sent to the remote agent).

Can anyone point me towards any information on what the potential impact
of doing this would be?

I'm assuming if both indexes have similar sets of documents, the impact
would be negligible, but what, for example would be the impact of
creating an MLT query from an index with only one or two documents for
use in an index with several (say 100+) documents,

with thanks,
Peter

--
with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing a long list

2013-03-31 Thread Jack Krupansky

Multivalued fields are the other approach to keyword value pairs.

And if you can denormalize your data, storing structure as separate 
documents can make sense and support more powerful queries. Although the 
join capabilities are rather limited.


-- Jack Krupansky

-Original Message- 
From: Paul Bell

Sent: Sunday, March 31, 2013 8:52 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing a long list

Hi Jack,

Thanks for the reply. I am very new to Lucene.

Your timing is a bit uncanny. I was just coming to the conclusion that
there's nothing special about this case for Lucene, i.e., a tokenized field
should work, when I looked up and saw your e-mail.

In re the larger context: yeah, the properties in question here belong to
some kind of node, e.g., maybe a vertex in a graph DB. Possible properties
include 'name', 'type', 'inEdges', 'outEdges', etc. Most properties are
simple k=v pairs. But a few, notable the 'edge' properties, could be long
lists.

My intent was to create a Lucene Document for each node. The Fields in this
Document would represent all of the node's properties. A generic (not in
Lucene syntax) query should be able to ask after any property, e.g.,

   ('name' equals "vol1" AND 'outEdges.name' startsWith "hasMirror")

Note that 'outEdges.name' represents multiple elements, where 'name'
represents only one. That is, the generic query syntax is trying to match
any out-edge whose name property starts with "hasMirror". I haven't quite
crystallized the generic query syntax and don't know how best to map it to
both a Lucene query and to an appropriate Lucene index structure. Please
let me know if you've any suggestions!

Thanks again.

-Paul



On Sun, Mar 31, 2013 at 8:33 AM, Jack Krupansky 
wrote:



The first question is how do you want to access the data? What do you want
your queries to look like?

What is the larger context? Are these properties of larger documents? Are
there more than one per document? Etc.

Why not just store the property as a tokenized field? Then you can query
whether v(i) or v(j) are or are not present as keywords.

-- Jack Krupansky

-Original Message- From: Paul Bell
Sent: Sunday, March 31, 2013 8:21 AM
To: java-user@lucene.apache.org
Subject: Indexing a long list


Hi All,

Suppose I need to index a property whose value is a long list of terms. 
For

example,

   someProperty = ["v1", "v2",  , "v100"]

Please note that I could drop the leading "v" and index these as numbers
instead of strings.

But the question is what's the best practice in Lucene when dealing with a
case like this? I need to be able to retrieve the list. This makes methink
that I need to store it. And I suppose that the list could be stored in 
the

index itself or in the "content" to which the index points.

So there are really two parts to this question:

1. Lucene "best practices" for long list
2. Where to store such a list

Thanks for your help.

-Paul

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing a long list

2013-03-31 Thread Jack Krupansky
The first question is how do you want to access the data? What do you want 
your queries to look like?


What is the larger context? Are these properties of larger documents? Are 
there more than one per document? Etc.


Why not just store the property as a tokenized field? Then you can query 
whether v(i) or v(j) are or are not present as keywords.


-- Jack Krupansky

-Original Message- 
From: Paul Bell

Sent: Sunday, March 31, 2013 8:21 AM
To: java-user@lucene.apache.org
Subject: Indexing a long list

Hi All,

Suppose I need to index a property whose value is a long list of terms. For
example,

   someProperty = ["v1", "v2",  , "v100"]

Please note that I could drop the leading "v" and index these as numbers
instead of strings.

But the question is what's the best practice in Lucene when dealing with a
case like this? I need to be able to retrieve the list. This makes methink
that I need to store it. And I suppose that the list could be stored in the
index itself or in the "content" to which the index points.

So there are really two parts to this question:

1. Lucene "best practices" for long list
2. Where to store such a list

Thanks for your help.

-Paul 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Accent insensitive analyzer

2013-03-22 Thread Jack Krupansky

Start with the Standard Tokenizer:
https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html

-- Jack Krupansky

-Original Message- 
From: Jerome Blouin

Sent: Friday, March 22, 2013 12:53 PM
To: java-user@lucene.apache.org
Subject: RE: Accent insensitive analyzer

I understand that I can't configure it on an analyzer so on which class can 
I apply it?


Thank,
Jerome

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, March 22, 2013 12:38 PM
To: java-user@lucene.apache.org
Subject: Re: Accent insensitive analyzer

Try the ASCII Folding FIlter:
https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html

-- Jack Krupansky

-Original Message-
From: Jerome Blouin
Sent: Friday, March 22, 2013 12:22 PM
To: java-user@lucene.apache.org
Subject: Accent insensitive analyzer

Hello,

I'm looking for an analyzer that allows performing accent insensitive search 
in latin languages. I'm currently using the StandardAnalyzer but it doesn't 
fulfill this need. Could you please point me to the one I need to use? I've 
checked the javadoc for the various analyzer packages but can't find one. Do 
I need to implement my own analyzer?


Regards,
Jerome


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Accent insensitive analyzer

2013-03-22 Thread Jack Krupansky

Try the ASCII Folding FIlter:
https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html

-- Jack Krupansky

-Original Message- 
From: Jerome Blouin

Sent: Friday, March 22, 2013 12:22 PM
To: java-user@lucene.apache.org
Subject: Accent insensitive analyzer

Hello,

I'm looking for an analyzer that allows performing accent insensitive search 
in latin languages. I'm currently using the StandardAnalyzer but it doesn't 
fulfill this need. Could you please point me to the one I need to use? I've 
checked the javadoc for the various analyzer packages but can't find one. Do 
I need to implement my own analyzer?


Regards,
Jerome


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multi-value fields in Lucene 4.1

2013-03-22 Thread Jack Krupansky
I don't think there is a way of identifying which of the values of a 
multivalued field matched. But... I haven't checked the code to be 
absolutely certain whether their isn't some expert way.


Also, realize that multiple values could match, such as if you queried for 
"B*".


-- Jack Krupansky

-Original Message- 
From: Chris Bamford

Sent: Friday, March 22, 2013 5:57 AM
To: java-user@lucene.apache.org
Subject: Multi-value fields in Lucene 4.1

Hi,

If I index several similar values in a multivalued field (e.g. many authors 
to one book), is there any way to know which of these matched during a 
query?

e.g.

 Book "The art of Stuff", with authors "Bob Thingummy" and "Belinda 
Bootstrap"


If we queried for +(author:Be*) and matched this document, is there a way of 
drilling down and identifying the specific sub-field that actually triggered 
the match ("Belinda Bootstrap") ?  I was wondering what the lowest 
granularity of matching actually is - document / field / sub-field ...


I am happy to index with term vectors and positions if it helps.

Thanks,

- Chris 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting documents from suggestions

2013-03-16 Thread Jack Krupansky
I don't have time right now to debug your code right now, but make sure that 
the analysis is consistent between index and query. For example, "Apache" 
vs. "apache".


-- Jack Krupansky

-Original Message- 
From: Bratislav Stojanovic

Sent: Saturday, March 16, 2013 7:29 AM
To: java-user@lucene.apache.org
Subject: Re: Getting documents from suggestions

Hey Jack,

I've tried MoreLikeTHis, but it always returns me 0 hits. Here's the code,
it's very simple :

// test2
Index lucene = null;
try {
lucene = new Index();
MoreLikeThis mlt = new MoreLikeThis(lucene.reader);
mlt.setAnalyzer(lucene.analyzer);
Reader target = new StringReader("apache");
Query query = mlt.like(target, "contents");
TopDocs results = lucene.searcher.search(query, 10);
ScoreDoc[] hits = results.scoreDocs;
System.out.println("Total "+hits.length+" hits");
for (int i = 0; i < hits.length; i++) {
Document doc = lucene.searcher.doc(hits[i].doc);
System.out.println("Hit "+i+" : "+doc.getField("id").stringValue());
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (lucene != null) lucene.close();
}

Here are my fields in the index :

...
Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
doc.add(pathField);

doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
// id so we can fetch metadata
doc.add(new LongField("id", id, Field.Store.YES));
// add default user (guest)
doc.add(new LongField("userid", -1L, Field.Store.YES));

doc.add(new TextField("contents", new BufferedReader(new
InputStreamReader(fis, "UTF-8";
...

Do you have any clue what might be wrong? Btw, searching for "apache"
returns me 36 hits, so index is fine.

P.S. I even tried like(int) and passing Doc. Id. but same thing.

On Thu, Mar 14, 2013 at 10:45 PM, Jack Krupansky 
wrote:



Could you give us some examples of what you expect? I mean, how is your
suggested set of documents any different from simply executing a query 
with

the list of suggested terms (using q.op=OR)?

Or, maybe you want something like MoreLikeThis?

-- Jack Krupansky

-Original Message- From: Bratislav Stojanovic
Sent: Thursday, March 14, 2013 5:36 PM
To: java-user@lucene.apache.org
Subject: Getting documents from suggestions


Hi all,

How can I filter suggestions based on some value from the indexed field?
I have a stored 'id' field in my index and I want to use that to examine
documents
where the suggestion was found, but how to get Document from suggestion?
SpellChecker class only returns array of strings.

What classes should I use? Please help.

Thanx in advance.

--
Bratislav Stojanovic, M.Sc.

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
Bratislav Stojanovic, M.Sc. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting documents from suggestions

2013-03-14 Thread Jack Krupansky

Let's refine this...

If a top suggestion is X, do you simply want to know a few of the documents 
which have the highest term frequency for X?


Or is there some other term-oriented metric you might propose?

-- Jack Krupansky

-Original Message- 
From: Bratislav Stojanovic

Sent: Thursday, March 14, 2013 6:14 PM
To: java-user@lucene.apache.org
Subject: Re: Getting documents from suggestions

Wow that was fast :)

I have implemented a simple search box with auto-suggestions, so whenever
user
types in something, ajax call is fired to the SuggestServlet and in return
10 suggestions
are shown. It's working fine with the SpellChecker class, but I only get
array of Strings.

What I want is to get lucene Document instances so I can use doc.get("id")
to filter those suggestions.
This field is my field, doesn't have to do anything with the default Doc.
Id field Lucene generates.

Here's an example : when I type "apache" I get suggestions like 
"apache.org",

"apache2" etc.
Now I want to have something like this :
Document doc = SomeClass.getDocFromSuggestion("apache.org");
if (doc.get("id") == ...) {
  //add suggestion into the result
} else {
  //do nothing.
}

Is MoreLikeThis designed for this?

On Thu, Mar 14, 2013 at 10:45 PM, Jack Krupansky 
wrote:



Could you give us some examples of what you expect? I mean, how is your
suggested set of documents any different from simply executing a query 
with

the list of suggested terms (using q.op=OR)?

Or, maybe you want something like MoreLikeThis?

-- Jack Krupansky

-Original Message- From: Bratislav Stojanovic
Sent: Thursday, March 14, 2013 5:36 PM
To: java-user@lucene.apache.org
Subject: Getting documents from suggestions


Hi all,

How can I filter suggestions based on some value from the indexed field?
I have a stored 'id' field in my index and I want to use that to examine
documents
where the suggestion was found, but how to get Document from suggestion?
SpellChecker class only returns array of strings.

What classes should I use? Please help.

Thanx in advance.

--
Bratislav Stojanovic, M.Sc.

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
Bratislav Stojanovic, M.Sc. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Getting documents from suggestions

2013-03-14 Thread Jack Krupansky
Could you give us some examples of what you expect? I mean, how is your 
suggested set of documents any different from simply executing a query with 
the list of suggested terms (using q.op=OR)?


Or, maybe you want something like MoreLikeThis?

-- Jack Krupansky

-Original Message- 
From: Bratislav Stojanovic

Sent: Thursday, March 14, 2013 5:36 PM
To: java-user@lucene.apache.org
Subject: Getting documents from suggestions

Hi all,

How can I filter suggestions based on some value from the indexed field?
I have a stored 'id' field in my index and I want to use that to examine
documents
where the suggestion was found, but how to get Document from suggestion?
SpellChecker class only returns array of strings.

What classes should I use? Please help.

Thanx in advance.

--
Bratislav Stojanovic, M.Sc. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Boolean Query not working in Lucene 4.0

2013-02-26 Thread Jack Krupansky
Try detailing both your expected behavior and the actual behavior. Try 
providing an actual code snippet and actual index and query data. Is it 
failing for all types and titles or just for some?


-- Jack Krupansky

-Original Message- 
From: saisantoshi

Sent: Tuesday, February 26, 2013 6:52 PM
To: java-user@lucene.apache.org
Subject: Boolean Query not working in Lucene 4.0

The following query does not seems to work after we upgrade from 2.4 - 4.0

*+type:sometype +title:sometitle**

Any ideas as to what are some of the places to look for? Is the above Query
correct in syntax. Appreciate if you could advise on the above?

Thanks,
Sai.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boolean-Query-not-working-in-Lucene-4-0-tp4043246.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: possible bug on Spellchecker

2013-02-20 Thread Jack Krupansky

Any reason that you are not using the DirectSpellChecker?

See:
http://lucene.apache.org/core/4_0_0/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html

-- Jack Krupansky

-Original Message- 
From: Samuel García Martínez

Sent: Wednesday, February 20, 2013 3:34 PM
To: java-user@lucene.apache.org
Subject: possible bug on Spellchecker

Hi all,

Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene
Spellchecker) behaviour i think i found a bug when the input is a 6 letter
word:
 - george
 - anthem
 - argued
 - fluent

Due to the getMin() and getMax() the grams indexed for these terms are 3
and 4. So, the fields would be something like this:
 - for "*george*"
- start3: "geo"
- start4: "geor"
- end3: "rge"
- end4: "orge"
- 3: "geo", "eor", "org", "rge"
- 4: "geor", "eorg", "orge"
 - for "*anthem*"
- start3: "ant"
- start4: "anth"
- end3: "tem"
- end4: "them"

The problem shows up when the user swap 3rd a 4th characters, misspelling
the word like this:
 - geroge
 - anhtem

The queries generated for this terms are: (SHOULD boolean queries)
- for "*geroge*"
 - start3: "ger"
 - start4: "gero"
 - end3: "oge"
 - end4: "roge"
 - 3: "ger", "ero", "rog", "oge"
 - 4: "gero", "erog", "roge"
- for "*anhtem*"
 - start3: "anh"
 - start4: "anht"
 - end3: "tem"
 - end4: "htem"
 - 3: "anh", "nht", "hte", "tem"
 - 4: "anht", "nhte", "htem"

So, as you can see, this kind of misspelling never matches the suitable
suggestions although the edit distance is 0.9556.

I think getMin(int l) and getMax(int l) should return 2 and 3,
respectively, for l==6. Debugging other values i did not found any problem
with any kind of misspelling.

Any thoughts about this?

--
Un saludo,
Samuel García 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping and tokens

2013-02-19 Thread Jack Krupansky
Well, you don't need to "store" both copies since they will be the same. 
They both need to be "indexed" (string form for grouping, text form for 
keyword search), but only one needs to be "stored".


-- Jack Krupansky

-Original Message- 
From: Ramprakash Ramamoorthy

Sent: Tuesday, February 19, 2013 1:07 AM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

On Tue, Feb 19, 2013 at 12:57 PM, Jack Krupansky 
wrote:



Oops, sorry for the "Solr" answer. In Lucene you need to simply index the
same value, once as a raw string and a second time as a tokenized text
field. Grouping would use the raw string version of the data.

Yeah, thanks Jack. Was just wondering if there would be a better alternate

rather than 2x storing. But I don't see any. Thanks again.


-- Jack Krupansky

-Original Message- From: Jack Krupansky
Sent: Monday, February 18, 2013 11:21 PM

To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

Okay, so, fields that would normally need to be tokenized must be stored 
as
both raw strings for grouping and tokenized text for keyword search. 
Simply

use copyField to copy from one to the other.

-- Jack Krupansky

-Original Message- From: Ramprakash Ramamoorthy
Sent: Monday, February 18, 2013 11:13 PM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky
**wrote:

 Please clarify exactly what you want to group by - give a specific 
example
that makes it clear what terms should affect grouping and which 
shouldn't.



Assume I am indexing a library data. Say there are the following fields 
for

a particular book.
1. Published
2. Language
3. Genre
4. Author
5. Title
6. ISBN

While search time, the user can ask to group by any of the above
fields, which means all of them are not supposed to be tokenized. So as I
had told earlier, there is a book titled "Fifty shades of gray" and the
user searches for "shades". The result turns up in case the field is
tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear?

    In a nutshell, how do I use a groupby on a field that is also
tokenized?



-- Jack Krupansky

-Original Message- From: Ramprakash Ramamoorthy
Sent: Monday, February 18, 2013 6:12 AM
To: java-user@lucene.apache.org
Subject: Grouping and tokens


Hello all,

From the grouping javadoc, I read that fields that are supposed to be
grouped should not be tokenized. I have an use case where the user has 
the

freedom to group by any field during search time.

Now that only tokenized fields are eligible for grouping, this is
creating an issue with my search. Say for instance the book "*Fifty 
shades

of grey*" when tokenized and searched for "*shades*" turns up in the

result. However this is not the case when I have it as a non-tokenized
field (using StandardAnalyzer-Version4.1).

How do I go about this? Is indexing a tokenized and non-tokenized
version of the same field the only go? I am afraid its way too costly!
Thanks in advance for your valuable inputs.

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420

--**
--**-
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<
java-user-**unsubscr...@lucene.apache.org
>
For additional commands, e-mail: java-user-help@lucene.apache.org<
java-user-help@lucene.**apache.org >





--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India.
+91 9626975420


--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org


--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping and tokens

2013-02-18 Thread Jack Krupansky
Oops, sorry for the "Solr" answer. In Lucene you need to simply index the 
same value, once as a raw string and a second time as a tokenized text 
field. Grouping would use the raw string version of the data.


-- Jack Krupansky

-Original Message----- 
From: Jack Krupansky

Sent: Monday, February 18, 2013 11:21 PM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

Okay, so, fields that would normally need to be tokenized must be stored as
both raw strings for grouping and tokenized text for keyword search. Simply
use copyField to copy from one to the other.

-- Jack Krupansky

-Original Message- 
From: Ramprakash Ramamoorthy

Sent: Monday, February 18, 2013 11:13 PM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky
wrote:


Please clarify exactly what you want to group by - give a specific example
that makes it clear what terms should affect grouping and which shouldn't.



Assume I am indexing a library data. Say there are the following fields for
a particular book.
1. Published
2. Language
3. Genre
4. Author
5. Title
6. ISBN

While search time, the user can ask to group by any of the above
fields, which means all of them are not supposed to be tokenized. So as I
had told earlier, there is a book titled "Fifty shades of gray" and the
user searches for "shades". The result turns up in case the field is
tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear?

In a nutshell, how do I use a groupby on a field that is also
tokenized?



-- Jack Krupansky

-Original Message- From: Ramprakash Ramamoorthy
Sent: Monday, February 18, 2013 6:12 AM
To: java-user@lucene.apache.org
Subject: Grouping and tokens


Hello all,

From the grouping javadoc, I read that fields that are supposed to be
grouped should not be tokenized. I have an use case where the user has the
freedom to group by any field during search time.

Now that only tokenized fields are eligible for grouping, this is
creating an issue with my search. Say for instance the book "*Fifty shades
of grey*" when tokenized and searched for "*shades*" turns up in the

result. However this is not the case when I have it as a non-tokenized
field (using StandardAnalyzer-Version4.1).

How do I go about this? Is indexing a tokenized and non-tokenized
version of the same field the only go? I am afraid its way too costly!
Thanks in advance for your valuable inputs.

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India.
+91 9626975420


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping and tokens

2013-02-18 Thread Jack Krupansky
Okay, so, fields that would normally need to be tokenized must be stored as 
both raw strings for grouping and tokenized text for keyword search. Simply 
use copyField to copy from one to the other.


-- Jack Krupansky

-Original Message- 
From: Ramprakash Ramamoorthy

Sent: Monday, February 18, 2013 11:13 PM
To: java-user@lucene.apache.org
Subject: Re: Grouping and tokens

On Mon, Feb 18, 2013 at 9:47 PM, Jack Krupansky 
wrote:



Please clarify exactly what you want to group by - give a specific example
that makes it clear what terms should affect grouping and which shouldn't.



Assume I am indexing a library data. Say there are the following fields for
a particular book.
1. Published
2. Language
3. Genre
4. Author
5. Title
6. ISBN

While search time, the user can ask to group by any of the above
fields, which means all of them are not supposed to be tokenized. So as I
had told earlier, there is a book titled "Fifty shades of gray" and the
user searches for "shades". The result turns up in case the field is
tokenized. But here it doesn't, since it isn't tokenized. Hope I am clear?

In a nutshell, how do I use a groupby on a field that is also
tokenized?



-- Jack Krupansky

-Original Message- From: Ramprakash Ramamoorthy
Sent: Monday, February 18, 2013 6:12 AM
To: java-user@lucene.apache.org
Subject: Grouping and tokens


Hello all,

From the grouping javadoc, I read that fields that are supposed to be
grouped should not be tokenized. I have an use case where the user has the
freedom to group by any field during search time.

Now that only tokenized fields are eligible for grouping, this is
creating an issue with my search. Say for instance the book "*Fifty shades
of grey*" when tokenized and searched for "*shades*" turns up in the

result. However this is not the case when I have it as a non-tokenized
field (using StandardAnalyzer-Version4.1).

How do I go about this? Is indexing a tokenized and non-tokenized
version of the same field the only go? I am afraid its way too costly!
Thanks in advance for your valuable inputs.

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420

--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India.
+91 9626975420 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping and tokens

2013-02-18 Thread Jack Krupansky
Please clarify exactly what you want to group by - give a specific example 
that makes it clear what terms should affect grouping and which shouldn't.


-- Jack Krupansky

-Original Message- 
From: Ramprakash Ramamoorthy

Sent: Monday, February 18, 2013 6:12 AM
To: java-user@lucene.apache.org
Subject: Grouping and tokens

Hello all,

From the grouping javadoc, I read that fields that are supposed to be
grouped should not be tokenized. I have an use case where the user has the
freedom to group by any field during search time.

Now that only tokenized fields are eligible for grouping, this is
creating an issue with my search. Say for instance the book "*Fifty shades
of grey*" when tokenized and searched for "*shades*" turns up in the
result. However this is not the case when I have it as a non-tokenized
field (using StandardAnalyzer-Version4.1).

How do I go about this? Is indexing a tokenized and non-tokenized
version of the same field the only go? I am afraid its way too costly!
Thanks in advance for your valuable inputs.

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
India,
+91 9626975420 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: fuzzy queries

2013-02-09 Thread Jack Krupansky

You probably are not getting this document returned:

   list.add("strfffing_ m atcbbhing");

because... both terms have an edit distance greater than two.

All the other documents have one or the other or both terms with an editing 
distance of 2 or less.


Your query is essentially: Match a document if EITHER term matches. So, if 
NEITHER matches (within an editing distance of 2), the document is not a 
match.


-- Jack Krupansky

-Original Message- 
From: Pierre Antoine DuBoDeNa

Sent: Saturday, February 09, 2013 12:52 PM
To: java-user@lucene.apache.org
Subject: Re: fuzzy queries

with query like string~ matching~ (without specifying threshold) i get 14
results back..

Can it be problem with the analyzers?

Here is the code:

private File indexDir = new File("/a-directory-here");

private StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

private IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
analyzer);

public static void main(String[] args) throws Exception {

IndexProfiles Indexer = new IndexProfiles();

IndexWriter w = Indexer.CreateIndex();

ArrayList list = new ArrayList();

list.add("string matching");

list.add("string123 matching");

list.add("string matching123");

list.add("string123 matching123");

list.add("str4ing match2ing");

list.add("1string 2matching");

list.add("str_ing ma_tching");

list.add("string_matching");

list.add("strang mutching");

list.add("strrring maatchinng");

list.add("strfffing_ m atcbbhing");

list.add("str2ing__mat3ching");

list.add("string_m atching");

list.add("string matching another token");

list.add("strasding matc4hing ano23ther tok3en");

list.add("str4ing maaatching_another 2t oken");


 for (String companyname:list)

{

Indexer.addSingleField(w, companyname);

}


int numDocs = w.numDocs();

   System.out.println("# of Docs in Index: " + numDocs);

   w.close();


   DoIndexQuery("string~ matching~");

 }

public static void DoIndexQuery(String query) throws IOException,
ParseException {

IndexProfiles Indexer = new IndexProfiles();

   IndexReader reader = Indexer.LoadIndex();


Indexer.SearchIndex(reader, query, 50);



   reader.close();

}


public IndexWriter CreateIndex() throws IOException {



Directory index = FSDirectory.open(indexDir);

IndexWriter w = new IndexWriter(index, config);

return w;



}


public HashMap SearchIndex(IndexReader w, String query, int topk)
throwsIOException, ParseException {



 Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
).parse(query);



IndexSearcher searcher = new IndexSearcher(w);

TopScoreDocCollector collector = TopScoreDocCollector.create(topk, true);

searcher.search(q, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;



System.out.println("Found " + hits.length + " hits.");

HashMap map = new HashMap();

for(int i=0;i


Can you reduce your test case to indexing one document/field and
running a single FuzzyQuery (you seem to be running two at once,
OR'ing the results)?

And show the complete standalone source code (eg what is topk?) so we
can see how you are indexing / building the Query / searching.

The default minSim is 0.5.

Note that 0.01 is not useful in practice: it (should) match nearly all
terms.  But I agree it's odd one term is not matching.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Feb 9, 2013 at 5:20 AM, Pierre Antoine DuBoDeNa
 wrote:
>>
>> Hello,
>>
>> I use lucene 3.6 and i try to use fuzzy queries so that I can match 
>> much

>> more results.
>>
>> I am adding for example these strings:
>>
>>  list.add("string matching");
>>
>> list.add("string123 matching");
>>
>> list.add("string matching123");
>>
>> list.add("string123 matching123");
>>
>> list.add("str4ing match2ing");
>>
>> list.add("1string 2matching");
>>
>> list.add("str_ing ma_tching");
>>
>> list.add("string_matching");
>>
>> list.add("strang mutching");
>>
>> list.add("strrring maatchinng");
>>
>> list.add("strfffing_ m atcbbhing");
>>
>> list.add("str2ing__mat3ching");
>>
>> list.add("string_m atching");
>>
>> list.add("string matching another token");
>>
>> list.add("strasding matc4hing ano23ther tok3en");
>>
>> list.add("str4ing maaatching_another 2t oken");
>>
>>
>>
>> then i do a query:
>>
>>
>> "string~0.01 matching~0.01"
>&

Re: Wildcard in a text field

2013-02-08 Thread Jack Krupansky
Ah, okay... some people call that "prospective" search. In any case, there 
is no direct Lucene support that I know of.


There are some references here:
http://lucene.apache.org/core/4_0_0/memory/org/apache/lucene/index/memory/MemoryIndex.html

-- Jack Krupansky

-Original Message- 
From: Nicolas Roduit

Sent: Friday, February 08, 2013 10:14 AM
To: java-user@lucene.apache.org
Subject: Re: Re: Wildcard in a text field

For instance, I have a list of tags related to a text. Each text with
its list of tags are put in a document and indexed by Lucene. If we
consider that a tag is "buddh*" and I would like to make a query (e.g.
"buddha" or "buddhism" or "buddhist") and find the document that contain
"buddh*".

Thanks,


Le 08. 02. 13 13:35, Jack Krupansky a écrit :
That description is too vague. Could you provide a couple of examples of 
index text and queries and what you expect those queries to match.


If you simply want to query for "*" and "?" in "string" fields, escape 
them with a backslash. But if you want to escape them in "text" fields, be 
sure to use an analyzer that preserves them since they generally will be 
treated as spaces.


-- Jack Krupansky

-Original Message- From: Nicolas Roduit
Sent: Friday, February 08, 2013 2:49 AM
To: java-user@lucene.apache.org
Subject: Wildcard in a text field

I'm looking for a way of making a query on words which contain wildcards
(* or ?). In general, we use wildcards in query, not in the text. I
haven't find anything in Lucene to build that.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Wildcard in a text field

2013-02-08 Thread Jack Krupansky
That description is too vague. Could you provide a couple of examples of 
index text and queries and what you expect those queries to match.


If you simply want to query for "*" and "?" in "string" fields, escape them 
with a backslash. But if you want to escape them in "text" fields, be sure 
to use an analyzer that preserves them since they generally will be treated 
as spaces.


-- Jack Krupansky

-Original Message- 
From: Nicolas Roduit

Sent: Friday, February 08, 2013 2:49 AM
To: java-user@lucene.apache.org
Subject: Wildcard in a text field

I'm looking for a way of making a query on words which contain wildcards
(* or ?). In general, we use wildcards in query, not in the text. I
haven't find anything in Lucene to build that.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene vs Glimpse

2013-02-04 Thread Jack Krupansky
Generally, all of your example queries should work fine with Lucene, 
provided that you carefully choose your analyzer, or even use the 
StandardAnalyzer. The special characters like underscore and dot generally 
get treated as spaces and the resulting sequence of terms would match as a 
phrase. It won't be a 100% solution, but it should do reasonably well.


Is there a query that was failing to match reasonably for you?

-- Jack Krupansky

-Original Message- 
From: Mathias Dahl

Sent: Monday, February 04, 2013 1:01 PM
To: java-user@lucene.apache.org
Subject: Lucene vs Glimpse

Hi,

I have hacked together a small web front end to the Glimpse text
indexing engine (see http://webglimpse.net/ for information). I am
very happy with how Glimpse indexes and searches data. If I understand
it correctly it uses a combination of an index and searching directly
in the files themselves as grep or other tools. The problem is that I
discovered it is not open source and now that I want to extend the use
from private to company wide I will run into license problems/costs.

So, I decided to try out Lucene. I tried the examples and changed them
a bit to use another analyzer. But when I started to think about it I
realized that I will not be able to build something like Glimpse. At
least not easily.

Why? I will try to explain:

As stated above, Glimpse uses a combination of index and in-file
search. This makes it very powerful in the sense that I can get hits
for things that are not necessarily being indexes as terms. Let's say
I have a file with this content:

...
import foo.bar.baz;
...

With Glimpse, and without telling it how to index the content I can
find the above file using a search string like "foo" or "bar" but
also, and this is important, using foo.bar.baz.

Another example:

We have a lot of PL/SQL source code, and often you can find code like this:

...
My_Nice_API.Some_Method
...

Here too, Glimpse is almost magic since it combines index and normal
search. I can find the file above using "My_Nice_API" or
"My_Nice_API.Some_Method".

In a sense I can have the cake and eat it too.

If I want to do similar "free" search stuff with Lucene I think I have
to create analyzers for the different kind of source code files, with
fields for this and that. Quite an undertaking.

Does anyone understand my point here and am I correct in that it would
be hard to implement something as "free" as with Glimpse? I am not
trying to critizise, just understand how Lucene (and Glimpse) works.

Oh, yes, Glimpse has one big drawback: it only supports search strings
up to 32 characters.

Thanks!

/Mathias

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to find related words ?

2013-01-31 Thread Jack Krupansky
Oh, so you wanted "similar" words! You should have said so... your inquiry 
said you were looking for "related" words. So, which is it? More 
specifically, what exactly are you looking for, in terms of the semantics? 
In any case, "find similar" (MoreLikeThis) is about the best you can do out 
of the box.


-- Jack Krupansky

-Original Message- 
From: Andrew Gilmartin

Sent: Thursday, January 31, 2013 9:04 AM
To: java-user@lucene.apache.org
Subject: Re: How to find related words ?

wgggfiy wrote:

en, it seems nice, but I'm puzzled by you and Andrew Gilmartina above,
what's the difference between you guys ?


The different is that similar documents do not give you similar terms. 
Similar documents can show a correlation of terms -- ie, whereever Lucene is 
mentioned so is Solr and Hadoop -- but in no way does this mean that the 
terms are similar. Accumulating similar and/or synonymous terms is a manual 
process. I am sure there are text mining tools/algorithms that make 
discoveries, but I do not know about these. (I am a journeyman programmer 
not a researcher.) If anyone does know about them, please share with this 
list.


-- Andrew

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to find related words ?

2013-01-30 Thread Jack Krupansky

Take a look at MoreLikeThisQuery:
http://lucene.apache.org/core/4_1_0/queries/org/apache/lucene/queries/mlt/MoreLikeThisQuery.html

And MoreLikeThis itself:
http://lucene.apache.org/core/4_1_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html

So, the idea is search for documents using your keyword(s) and ask Lucene to 
extract relevant terms from the top document(s).


-- Jack Krupansky

-Original Message- 
From: wgggfiy

Sent: Wednesday, January 30, 2013 12:27 PM
To: java-user@lucene.apache.org
Subject: How to find related words ?

In short,
you put in a term like "Lucene",
and The ideal output would be "solr", "index", "full-text search", and so
on.
How to make it ? to find the related words. thx
My idea is to use FuzzyQuery, or MoreLikeThis, or calc the score with all
the terms and then sort.
Any idea ?



-
--
Email: wuqiu.m...@qq.com
--
--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-find-related-words-tp4037462.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions about FuzzyQuery in Lucene 4.x

2013-01-29 Thread Jack Krupansky
I'm sorry, but for anybody to help you here, you really need to be able to 
provide a concise test case, like 10-20 lines of code, completely 
self-contained. If you think you need a million documents to repro what you 
claimed was a simple scenario, then you leave me very, very confused - and 
unable to help you any further.


-- Jack Krupansky

-Original Message- 
From: George Kelvin

Sent: Tuesday, January 29, 2013 2:43 PM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x

Hi Jack,

The problematic query is "scar"+"wads".

There are several (more than 10) documents in the data with the content
"star wars", so I think that query should be able to find all these
documents.

I was trying to provide a minimal test case, but I couldn't reduce the size
of data showing the failure.

The size of the minimal data showing the failure I got so far is around 2
million.

However, I found a suspicious document with content "scor". If I remove it
from the 2 million documents data, that query can find all the "star wars"
documents. If I add it back, then the query can't find any.

I tried to reduce the size of the data to 1 million further and add that
"scor" document, but now the query can still find all the "star wars"
documents.

Is it possible that Lucene somehow fail to find all the valid terms within
the edit distance?

Thanks!

George


On Tue, Jan 29, 2013 at 10:02 AM, Jack Krupansky 
wrote:



I also noticed that you have "MUST" for your full string of fuzzy terms -
that means everyone of them must appear in an indexed document to be
matched. Is it possible that maybe even one term was not in the same
indexed document?

Try to provide a complete example that shows the input data and the query
- all the literals. In other words, construct a minimal test case that
shows the failure.


-- Jack Krupansky

-Original Message- From: George Kelvin
Sent: Tuesday, January 29, 2013 12:28 PM

To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x

Hi Jack,

ed is set to 1 here and I have lowercased all the data and queries.

Regarding the indexed data factor you mentioned, can you elaborate more?

Thanks!

George


On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky *
*wrote:

 That depends on the value of "ed", and the indexed data.


Another factor to take into consideration is that a case change ("Star"
vs. "star") also counts as an edit.

-- Jack Krupansky

-Original Message- From: George Kelvin
Sent: Tuesday, January 29, 2013 11:49 AM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x


Hi Jack,

Thanks for your reply!

I don't think I passed the prefixLength parameter in.

Here is the code I used to build the FuzzyQuery:

   String[] words = str.split("\\+");
   BooleanQuery query = new BooleanQuery();

   for (int i=0; i
>
For additional commands, e-mail: java-user-help@lucene.apache.org<
java-user-help@lucene.**apache.org >





--**--**-
To unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions about FuzzyQuery in Lucene 4.x

2013-01-29 Thread Jack Krupansky
I also noticed that you have "MUST" for your full string of fuzzy terms - 
that means everyone of them must appear in an indexed document to be 
matched. Is it possible that maybe even one term was not in the same indexed 
document?


Try to provide a complete example that shows the input data and the query - 
all the literals. In other words, construct a minimal test case that shows 
the failure.


-- Jack Krupansky

-Original Message- 
From: George Kelvin

Sent: Tuesday, January 29, 2013 12:28 PM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x

Hi Jack,

ed is set to 1 here and I have lowercased all the data and queries.

Regarding the indexed data factor you mentioned, can you elaborate more?

Thanks!

George


On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky 
wrote:



That depends on the value of "ed", and the indexed data.

Another factor to take into consideration is that a case change ("Star"
vs. "star") also counts as an edit.

-- Jack Krupansky

-Original Message- From: George Kelvin
Sent: Tuesday, January 29, 2013 11:49 AM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x


Hi Jack,

Thanks for your reply!

I don't think I passed the prefixLength parameter in.

Here is the code I used to build the FuzzyQuery:

   String[] words = str.split("\\+");
   BooleanQuery query = new BooleanQuery();

   for (int i=0; iTo unsubscribe, e-mail: 
java-user-unsubscribe@lucene.**apache.org
For additional commands, e-mail: 
java-user-help@lucene.apache.**org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions about FuzzyQuery in Lucene 4.x

2013-01-29 Thread Jack Krupansky

That depends on the value of "ed", and the indexed data.

Another factor to take into consideration is that a case change ("Star" vs. 
"star") also counts as an edit.


-- Jack Krupansky

-Original Message- 
From: George Kelvin

Sent: Tuesday, January 29, 2013 11:49 AM
To: java-user@lucene.apache.org
Subject: Re: Questions about FuzzyQuery in Lucene 4.x

Hi Jack,

Thanks for your reply!

I don't think I passed the prefixLength parameter in.

Here is the code I used to build the FuzzyQuery:

   String[] words = str.split("\\+");
   BooleanQuery query = new BooleanQuery();

   for (int i=0; iGeorge 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions about FuzzyQuery in Lucene 4.x

2013-01-28 Thread Jack Krupansky
Let's see your code that calls FuzzyQuery . If you happen to pass a 
prefixLength (3rd parameter) of 3 or more, then "ster" would not match 
"star" (but prefixLength of 2 would match).


-- Jack Krupansky

-Original Message- 
From: George Kelvin

Sent: Monday, January 28, 2013 5:31 PM
To: java-user@lucene.apache.org
Subject: Questions about FuzzyQuery in Lucene 4.x

Hi All,

I’m working on several projects requiring powerful search features. I’ve
been waiting for Lucene 4 since I read Michael McCandless's blog post about
Lucene’s new FuzzyQuery and finally I got the chance to test it. The
improvement on the fuzzy search is really impressive!

However, I’ve encountered a problem today:

In my data, there are some records with keywords “star” and “wars”. But
when I issued a fuzzy query with two keywords “ster” and “wats”, the engine
failed to find the records.

I’m wondering if you can provide any inputs on that. Maybe I’m not doing
fuzzy search in the right way. But all my other fuzzy queries with single
keyword and with longer double keywords worked perfectly.

Another issue is that I’m also exploring the possibility to do
wildcard+fuzzy search using Lucene. I couldn’t find any related document
for this on Lucene's website, but I found a stackoverfow thread talking
about this.

http://stackoverflow.com/questions/2631206/lucene-query-bla-match-words-that-start-with-something-fuzzy-how

I tried the way suggested by the second answer their, and it worked.
However the scoring is strange: all results were assigned with the exactly
same score. Is there anything I can do to get the scoring right?

Can you tell me what’s the best way to do wildcard fuzzy search?

Any help will be appreciated!

Thanks,

George 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky
Re-read my last message - and then take a look at that Solr source code, 
which will give you an idea how to use Tika, even though you are using 
Lucene only. If you have specific questions, please be specific.


To answer your latest question, yes, Tika is good enough. Solr 
/update/extract uses it, and Solr is based on Lucene.


-- Jack Krupansky

-Original Message- 
From: saisantoshi

Sent: Sunday, January 27, 2013 2:09 PM
To: java-user@lucene.apache.org
Subject: Re: Readers for extracting textual info from pd/doc/excel for 
indexing the actual content


We are not using Solr and using just Lucene core 4.0 engine. I am trying to
see if we can use tika library to extract textual information from
pdf/word/excel documents. I am mainly interested in reading the contents
inside the documents and index using lucene. My question here is , is tika
framework good enough or is there any other better library. Any
issues/experiences in using the tika framework.

Thanks,
Sai.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Readers-for-extracting-textual-info-from-pd-doc-excel-for-indexing-the-actual-content-tp4036379p4036557.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Readers for extracting textual info from pd/doc/excel for indexing the actual content

2013-01-27 Thread Jack Krupansky
You may be able to use Tika directly without needing to choose the specific 
classes, although the latter may give you the specific data you need without 
the extra overhead.


You could take a look at the Solr Extracting Request Handler source for an 
example:

http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_1_0/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/

Basically, Tika extracts a bunch of "metadata" and then you will have to add 
selected metadata to your Lucene documents. "content" is the main document 
body text.


You could try Solr itself to see how it works:
http://wiki.apache.org/solr/ExtractingRequestHandler

-- Jack Krupansky

-Original Message- 
From: Adrien Grand

Sent: Sunday, January 27, 2013 12:53 PM
To: java-user@lucene.apache.org
Subject: Re: Readers for extracting textual info from pd/doc/excel for 
indexing the actual content


Have you tried using the PDFParser [1] and the OfficeParser [2]
classes from Tika?

This question seems to be more appropriate for the Tika user mailing list 
[3]?


[1] 
http://tika.apache.org/1.3/api/org/apache/tika/parser/pdf/PDFParser.html#parse(java.io.InputStream,

org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[2] 
http://tika.apache.org/1.3/api/org/apache/tika/parser/microsoft/OfficeParser.html#parse(java.io.InputStream,

org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
org.apache.tika.parser.ParseContext)
[3] http://tika.apache.org/mail-lists.html

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing multiple fields with one document position

2013-01-21 Thread Jack Krupansky
Send the same input text to two different analyzers for two separate fields. 
The first analyzer emits only the first attribute. The second analyzer emits 
only the second attribute. The document position in one will correspond to 
the document position in the other.


-- Jack Krupansky

-Original Message- 
From: Igor Shalyminov

Sent: Monday, January 21, 2013 3:04 AM
To: java-user@lucene.apache.org
Subject: Indexing multiple fields with one document position

Hello!

When indexing text with position data, one just adds field do a document in 
the form of its name and value, and the indexer assigns it unique position 
in the index.

I wonder, if I have an entry with two attributes, say:

cat,

How do I store in the index two fields, "pos" and "number" with its values, 
pointing to the same position in the document?


--
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SpanNearQuery with two boundaries

2013-01-18 Thread Jack Krupansky

+1

I think that accurately states the semantics of the operation you want.

-- Jack Krupansky

-Original Message- 
From: Alan Woodward

Sent: Friday, January 18, 2013 1:08 PM
To: java-user@lucene.apache.org
Subject: Re: SpanNearQuery with two boundaries

Hi Igor,

You could try wrapping the two cases in a SpanNotQuery:
SpanNot(SpanNear(runs, cat, 10), SpanNear(runs, cat, 3))

That should return documents that have runs within 10 positions of cat, as 
long as they don't overlap with runs within 3 positions of cat.


Alan Woodward
www.flax.co.uk


On 18 Jan 2013, at 16:13, Igor Shalyminov wrote:


Hello!

I want to perform search queries like this one:
word:"dog" \1 word:"runs" (\3 \10) word:"cat"

It is thus something like SpanNearQuery, but with two boundaries - minimum 
and maximum distance between the terms (which in the \1-case would be 
equal).
Syntax (as above, fictional:) itself doesn't matter, I just want to know 
if one is able to build this type of query based on existing (Lucene 
4.0.0) query classes.


--
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Combine two BooleanQueries by a SpanNearQuery.

2013-01-17 Thread Jack Krupansky
Currently there isn't. SpanNearQuery can take only other SpanQuery objects, 
which includes other spans, span terms, and span wrapped multi-term queries 
(e.g., wildcard, fuzzy query), but not Boolean queries.


But it does sound like a good feature request.

There is SpanNotQuery, so you can exclude terms from a span.

-- Jack Krupansky

-Original Message- 
From: Michel Conrad

Sent: Thursday, January 17, 2013 12:14 PM
To: java-user@lucene.apache.org
Subject: Re: Combine two BooleanQueries by a SpanNearQuery.

The problem I would like to solve is to have two queries that I will
get from the query parser (this could include wildcardqueries and
phrasequeries).
Both of these queries would have to match the document and as an
additional restriction I would like to add that a matching term from
the first query
is near a matching term from the second query.

So that you can search for instance for matching documents with
in your first query 'apple AND NOT computer'
and in your second 'monkey'
with a slop of 10 between the two queries, then it would be equivalent
to '"apple monkey"~10 AND NOT computer'.

I was wondering if there is a method to combine more complicated
queries in a similar way. (Some kind of generic solution)

Thanks for your help,

Michel

On Thu, Jan 17, 2013 at 5:14 PM, Jack Krupansky  
wrote:

You need to express the "boolean" query solely in terms of SpanOrQuery and
SpanNearQuery. If you can't, ... then it probably can't be done, but you
should be able to.

How about starting with a plan English description of the problem you are
trying to solve?

-- Jack Krupansky

-Original Message- From: Michel Conrad
Sent: Thursday, January 17, 2013 11:01 AM
To: java-user@lucene.apache.org
Subject: Combine two BooleanQueries by a SpanNearQuery.


Hi,

I am looking to get a combination of multiple subqueries.

What I want to do is to have two queries which have to be near one to
another.

As an example:
Query1: (A AND (B OR C))
Query2: D

Then I want to use something like a SpanNearQuery to combine both (slop 
5):

Both would then have to match and D should be within slop 5 to A, B or C.

So my question is if there is a query that combines two BooleanQuery
trees into a SpanNearQuery.
It would have to take the terms that match Query 1 and the terms that
match Query 2, and look if there is a combination within the required
slop.

Can I rewrite the BooleanQuery after parsing the query as a
MultiTermQuery, than wrap these in SpanMultiTermQueryWrapper, which
can be combined by the SpanNearQuery?

Best regards,
Michel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Combine two BooleanQueries by a SpanNearQuery.

2013-01-17 Thread Jack Krupansky
You need to express the "boolean" query solely in terms of SpanOrQuery and 
SpanNearQuery. If you can't, ... then it probably can't be done, but you 
should be able to.


How about starting with a plan English description of the problem you are 
trying to solve?


-- Jack Krupansky

-Original Message- 
From: Michel Conrad

Sent: Thursday, January 17, 2013 11:01 AM
To: java-user@lucene.apache.org
Subject: Combine two BooleanQueries by a SpanNearQuery.

Hi,

I am looking to get a combination of multiple subqueries.

What I want to do is to have two queries which have to be near one to 
another.


As an example:
Query1: (A AND (B OR C))
Query2: D

Then I want to use something like a SpanNearQuery to combine both (slop 5):
Both would then have to match and D should be within slop 5 to A, B or C.

So my question is if there is a query that combines two BooleanQuery
trees into a SpanNearQuery.
It would have to take the terms that match Query 1 and the terms that
match Query 2, and look if there is a combination within the required
slop.

Can I rewrite the BooleanQuery after parsing the query as a
MultiTermQuery, than wrap these in SpanMultiTermQueryWrapper, which
can be combined by the SpanNearQuery?

Best regards,
Michel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene-MoreLikethis

2013-01-15 Thread Jack Krupansky
There are lots of parameters you can adjust, but the defaults essentially 
assume that you have a fairly large corpus and aren't interested in 
low-frequency terms.


So, try MoreLikeThis#setMinDocFreq. The default is 5. You don't have any 
terms in your example with a doc freq over 2.


Also, try setMinTermFreq. The default is 2. You don't have any terms with a 
term frequency above 1.


-- Jack Krupansky

-Original Message- 
From: Thomas Keller

Sent: Tuesday, January 15, 2013 3:22 PM
To: java-user@lucene.apache.org
Subject: Lucene-MoreLikethis

Hey,

I have a question about "MoreLikeThis" in Lucene, Java. I built up an index 
and want to find similar documents. But I always get no results for my 
query, mlt.like(1) is always empty. Can anyone find my mistake? Here is an 
example. (I use Lucene 4.0)


public class HelloLucene {
 public static void main(String[] args) throws IOException, ParseException 
{


  StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
  Directory index = new RAMDirectory();
  IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, 
analyzer);


   IndexWriter w = new IndexWriter(index, config);
   addDoc(w, "Lucene in Action", "193398817");
   addDoc(w, "Lucene for Dummies", "55320055Z");
   addDoc(w, "Managing Gigabytes", "55063554A");
   addDoc(w, "The Art of Computer Science", "9900333X");
   w.close();

   // search
   IndexReader reader = DirectoryReader.open(index);
   IndexSearcher searcher = new IndexSearcher(reader);

   MoreLikeThis mlt = new MoreLikeThis(reader);
   Query query = mlt.like(1);
   System.out.println(searcher.search(query, 5).totalHits);
 }

 private static void addDoc(IndexWriter w, String title, String isbn) 
throws IOException {

   Document doc = new Document();
   doc.add(new TextField("title", title, Field.Store.YES));

   // use a string field for isbn because we don't want it tokenized
   doc.add(new StringField("isbn", isbn, Field.Store.YES));
   w.addDocument(doc);
 }
}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FuzzyQuery in lucene 4.0

2013-01-09 Thread Jack Krupansky
FWIW, new FuzzyQuery(term, 2 ,0) is the same as new FuzzyQuery(term), given 
the current values of defaultMaxEdits (2) and defaultPrefixLength (0).


-- Jack Krupansky

-Original Message- 
From: Ian Lea

Sent: Wednesday, January 09, 2013 9:44 AM
To: java-user@lucene.apache.org
Subject: Re: FuzzyQuery in lucene 4.0

See the javadocs for FuzzyQuery to see what the parameters are.  I
can't tell you what the comment means.  Possible values to try maybe?


--
Ian.


On Wed, Jan 9, 2013 at 2:34 PM, algebra  wrote:

is true Ian, o code is good.

The only thing that I dont understand is a line:

Query query = new FuzzyQuery(term, 2 ,0); //0-2

Whats means 0 to 2?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/FuzzyQuery-in-lucene-4-0-tp4031871p4031879.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Differences in MLT Query Terms Question

2013-01-08 Thread Jack Krupansky
The term "arv" is on the first list, but not the second. Maybe it's document 
frequency fell below the setting for minimum document frequency on the 
second run.


Or, maybe the minimum word length was set to 4 or more on the second run.

Are you using MoreLikeThisQuery or directly using MoreLikeThis?

Or, possibly "arv" appears later in a document on the second run, after the 
number of tokens specified by maxNumTokensParsed.


-- Jack Krupansky

-Original Message- 
From: Peter Lavin

Sent: Tuesday, January 08, 2013 1:46 PM
To: java-user@lucene.apache.org
Subject: Differences in MLT Query Terms Question


Dear Users,

I am running some simple experiments with Lucene and am seeing something
I don't understand.

I have 16 text files on 4 different topics, ranging in size from 50-900
KB.  When I index all 16 of these and run an MLT query based on one of
the indexed documents, I get an expected result (i.e. similar topics are
found).

When I reduce the number of text files to 4 and index them (having taken
care to overwriting the previous index files), and then run the same MLT
query (based on the same document from the index), I get slightly
different scores. I'm assuming this is because the IDF is now different
because there is less documents.

For each run, I have set the max number of terms as...
mlt.setMaxQueryTerms(100)

However, when I compare the terms which get used for the MLT query on
the 16 document index and the 4 document index, they are slightly
different. I've printed, parsed and sorted them into two columns of a
CSV file. I've pasted a small part of it at the end of this email.

My Question(s)...
1) Can anybody explain why the set of terms used for the MLT query is
different when a file from an index of 16 documents versus 4 documents
is used?

2) Am I right in assuming that the reason for slightly different scores
in the IDF, or could it be this slight difference in the sets of terms
used (or possibly both)?

regards,
Peter


--
with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536



"about","about"
"affordable","affordable"
"agents","agents"
"aids","aids"
"architecture","architecture"
"arv","based"
"based","blog"
"blog","board"
"board","business"
"business","care"
"care","commemorates"
"commemorates","contacts"
"contacts","contributions"
"contributions","coordinating"
"coordinating","core"
"core","countries"
"countries","country"
"country","data"
"data","decisions"
"decisions","details"
"details","disbursements"
"disbursements","documents"
"documents","donors"

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TokenFilter state question

2012-12-26 Thread Jack Krupansky


If you want the query parser to present a sequence of tokens to your 
analyzer as one unit, you need to enclose the sequence in quotes. Otherwise, 
the query parser will pass each of the terms to the analyzer as a single 
unit, with a reset for each, as you in fact report.


So, change:

   String querystr = "product:(Spring Framework Core) 
vendor:(SpringSource)";


to

   String querystr = "product:\"Spring Framework Core\" 
vendor:(SpringSource)";


-- Jack Krupansky

-Original Message- 
From: Jeremy Long

Sent: Wednesday, December 26, 2012 5:52 PM
To: java-user@lucene.apache.org
Subject: Re: TokenFilter state question

Actually, I had thought the same thing and had played around with the reset
method. However, in my example where I called the custom analyzer the
"reset" method was called after every token, not after the end of the
stream (which implies each token was treated as its own TokenStream?).

   String querystr = "product:(Spring Framework Core)
vendor:(SpringSource)";

   Map fieldAnalyzers = new HashMap();

   fieldAnalyzers.put("product", new
SearchFieldAnalyzer(Version.LUCENE_40));
   fieldAnalyzers.put("vendor", new
SearchFieldAnalyzer(Version.LUCENE_40));

   PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(
   new StandardAnalyzer(Version.LUCENE_40), fieldAnalyzers);
   QueryParser parser = new QueryParser(Version.LUCENE_40, field1,
wrapper);

   Query q = parser.parse(querystr);

In the above example (from the code snippets in the original email), if I
were to add a reset method it would be called 4 times (yes, I have tried
this). Reset gets called for each token "Spring" "Framework" "Core" and
"SpringSource". Thus, if I reset my internal state I would not achieve the
goal of have "Spring Framework Core" result in the tokens "Spring
SpringFramework Framework FrameworkCore Core". My question is - why would
these be treated as separate token streams?

--Jeremy

On Wed, Dec 26, 2012 at 10:54 AM, Jack Krupansky 
wrote:



You need a "reset" method that calls the super reset to reset the parent
state and then reset your own state.

http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/**
analysis/TokenStream.html#**reset()<http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html#reset()>

You probably don't have one, so only the parent state gets reset.

-- Jack Krupansky

-Original Message- From: Jeremy Long
Sent: Wednesday, December 26, 2012 9:08 AM
To: java-user@lucene.apache.org
Subject: TokenFilter state question


Hello,

I'm still trying to figure out some of the nuances of Lucene and I have 
run

into a small issue. I have created my own custom analyzer which uses the
WhitespaceTokenizer and chains together the LowercaseFilter,
StopwordFilter, and my own custom filter (below). I am using this analyzer
when searching (i.e. it is the analyzer used in a QueryParser).

The custom analyzers purpose is to add tokens by concatenating the 
previous

word with the current word. So that if you were given "Spring Framework
Core" the resulting tokens would be "Spring SpringFramework Framework
FrameworkCore Core". My problem is that when my query text is "Spring
Framework Core" I end up with left-over state in
my TokenPairConcatenatingFilter (the previousWord is a member field). So 
if

I end up re-using my query parser on a subsequent search for "Apache
Struts" I end up with the token stream of "CoreApache Apache ApacheStruts
Struts". The Initial "core" was left over state.

The left over state from the initial query appears to arise because in my
initial loop that collects all of the tokens from the underlying stream
only collects a single token. So the processing is - we collect the token
"spring", we write "spring" out to the stream and move it to the
previousWord. Next, we are at the end of the stream and we have no more
words in the list so the filter returns false. At this time, the filter is
called again and "Framework" is collected... repeat until end of tokens
from the query is reached; however, "Core" is left in the previousWord
field. The filter would work correctly with no state being left over if 
all

of the tokens were collected at the beginning (i.e. the first call to
incrementToken). Can anyone explain why all of the tokens would not be
collected and/or a work around so that when
QueryParser.parse("field:(**Spring Framework Core)") is called residual
state
is not left over in my token filter? I have two hack solutions - 1) don't
reuse the analyzer/QueryParser for subsequent queries or 2) build in a
reset mechanism to clear the previousWord field. I don't like either
solution and was hoping 

Re: Which token filter can combine 2 terms into 1?

2012-12-26 Thread Jack Krupansky
Ah! You're quoting full phrases. You weren't clear about that originally. 
Thanks for the clarification.


-- Jack Krupansky

-Original Message- 
From: Tom

Sent: Wednesday, December 26, 2012 5:54 PM
To: java-user@lucene.apache.org
Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 2:44 PM, Jack Krupansky 
wrote:



You still have the query parser's parsing before analysis to deal with, no
matter what magic you code in your analyzer.



Not quite.
"query parser's parsing" comes first, you are correct on that. But it is
irrelevant for splitting field values into search terms, because this part
of the whole process is done by an analyzer. Therefore, if you make sure
the correct analyzer is used, then the parsing and splitting into
individual search terms will be done by this analyzer, not by the query
parser.

Try it: Implement an analyzer with the SnippetFilter below. Start Luke and
make sure this analyzer is selected in "Analyzer to use for query parsing".
In the search expression, type in any length of text for example:

body:"word1 word2 word3"

and you will get the possibly combined Terms.

For example, let's say one snipped in your SnippetFilter is: "word2 word3"
you will get

Term 0: field=body text=word1
Term 1: field=body text=word2 word3

In this case, word2 and word3 will NOT be split.





-- Jack Krupansky

-Original Message- From: Tom
Sent: Friday, December 21, 2012 2:24 PM
To: java-user@lucene.apache.org

Subject: Re: Which token filter can combine 2 terms into 1?

On Fri, Dec 21, 2012 at 9:16 AM, Jack Krupansky *
*wrote:

 And to be more specific, most query parsers will have already separated

the terms and will call the analyzer with only one term at a time, so no
term recombination is possible for those parsed terms, at query time.

 Most analyzers will do that, yes. But if Xi writes his own analyzer with

his own combiner filter, then he should also use this for query generation
and thus get the desired combinations / snippets there as well.

Xi, here is the recipe:
- SnippetFilter extends TokenFilter
-SnippetFilter  needs access to your lexicon: a data structure to store
your snippets. In the general case this is a tree, and going along a 
branch

will tell you whenever a valid snipped has been built or if the snipped
could be longer. (Example: "internal revenue" can be one snippet but,
depending on the next token, a larger snipped of "internal revenue 
service"

could be built.)
- Logic of the SnippetFilter.incrementToken() goes something like this: 
You

need a loop which retrieves tokens from the input variable until the input
is empty. You store each retrieved token in a variable(s) x in
SnippetFilter . As long as you have a potential match against your 
lexicon,

you can continue in this loop. Once you realize that there is something
within x which can not possibly become a (longer) snippet, break out of 
the

loop and allow the consumer to retrieve it.
- make sure your analyzer inserts SnippetFilter at the correct spot in the
filter chain.

Cheers
FiveMileTom







-- Jack Krupansky
-Original Message- From: Erick Erickson
Sent: Friday, December 21, 2012 8:27 AM
To: java-user
Subject: Re: Which token filter can combine 2 terms into 1?


If it's a fixed list and not excessively long, would synonyms work?

But if theres some kind of logic you need to apply, I don't think you're
going to find anything OOB.
The problem is that by the time a token filter gets called, they are
already split up, you'll probably
have to write a custom filter that manages that logic.

Best
Erick


On Fri, Dec 21, 2012 at 4:16 AM, Xi Shen  wrote:

 Unfortunately, no...I am not combine every two term into one. I am


combining a specific pair.

E.g. the Token Stream: t1 t2 t2a t3
should be rewritten into t1 t2t2a t3

But the TS: t1 t2 t3 t2a
should not be rewritten, and it is already correct


On Fri, Dec 21, 2012 at 5:00 PM, Alan Woodward <
alan.woodward@romseysoftware.co.uk >>

wrote:

> Have a look at ShingleFilter:
>
http://lucene.apache.org/core/3_6_0/api/all/org/apache/**<http://lucene.apache.org/core/**3_6_0/api/all/org/apache/**>
lucene/analysis/shingle/ShingleFilter.html<http://**
lucene.apache.org/core/3_6_0/**api/all/org/apache/lucene/**
analysis/shingle/**ShingleFilter.html<http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>

>
> On 21 Dec 2012, at 08:42, Xi Shen wrote:
>
> > I have to use the white space and word delimiter to process the 
> > input

> > first. I tried many combination, and it seems to me that it is
inevitable
> > the term will be split into two :(
> >
> > I think developing my own filter is the only resolution...but I just
> cannot
> > find a guide to help me unders

<    1   2   3   >