Question about the extends the query parser to support NumericField on Lucene 2.9.0

2009-10-22 Thread java8964 java8964

Hi,  I have a problem to work support the NumericField in query parser.

My environment is like this:

Windows XP with 
C:\work\> java -version
java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) Client VM (build 11.0-b15, mixed mode, sharing)

I am using the lucene 2.9.0 releases.

I write my query parser class to support this numeric field, here is copy of 
the override methods:

/**
 * Create a new range query of query parser.
 * 
 * If the filed is a numeric field, return NumericRangeQuery;
 * otherwise, let super class handle it
 * 
 * @param fieldName The file name
 * @param part1 The lower bound
 * @param part2 The high bound
 * @throws IllegalArgumentExceptoin if the field type is not supported
 * @throws NumberFormatException if the query data does not match with the 
field type
 */
@Override
protected Query newRangeQuery(String fieldName, String part1, String part2, 
boolean inclusive)
{
fieldName = fieldName.toLowerCase();
if (LogUtil.getInstance().isDebugEnabled(DcQueryParser.class))
{
LogUtil.getInstance().debug(DcQueryParser.class,
"Create a new range query for: " + fieldName);
}

mFieldNames.add(fieldName);
IFieldDefinition fieldDef = mIndexDef.getFieldDefinition(fieldName);
if (part1.trim().startsWith("+"))
{
part1 = part1.substring(1);
}
if (part2.trim().startsWith("+"))
{
part2 = part2.substring(1);
}
if (fieldDef != null && fieldDef.isNumericField())
{
if (fieldDef.getFieldType() == IFieldDefinition.FieldType.INT)
{
return NumericRangeQuery.newIntRange(fieldDef.getName(), 
Integer.parseInt(part1), Integer.parseInt(part2), inclusive, inclusive);
} 
else if (fieldDef.getFieldType() == IFieldDefinition.FieldType.LONG)
{
  return NumericRangeQuery.newLongRange(fieldDef.getName(), 
Long.parseLong(part1), Long.parseLong(part2), inclusive, inclusive);
}
else if (fieldDef.getFieldType() == 
IFieldDefinition.FieldType.FLOAT)
{
   return NumericRangeQuery.newFloatRange(fieldDef.getName(), 
Float.parseFloat(part1), Float.parseFloat(part2), inclusive, inclusive);
}
else if (fieldDef.getFieldType() == 
IFieldDefinition.FieldType.DOUBLE)
{
   return NumericRangeQuery.newDoubleRange(fieldDef.getName(), 
Double.parseDouble(part1), Double.parseDouble(part2), inclusive, inclusive);
}
else
{
throw new IllegalArgumentException("Unsupported new Numeric 
field type, as the type is: " + fieldDef.getFieldType().name());
}
}
else
{
return super.newRangeQuery(fieldName, part1, part2, inclusive);
}
}

/**
 * Create a new term query of query parser.
 * If the filed is a numeric field, use xxxPrefixCoded
 * otherwise, let super class handle it
 * 
 * @param term The term object
 * @return The query object
 * @throws IllegalArgumentExceptoin if the field type is not supported
 * @throws NumberFormatException if the query data does not match with the 
field type
 */
@Override
protected Query newTermQuery(Term term)
{
System.out.println("..1");
String fieldName = term.field();
if (LogUtil.getInstance().isDebugEnabled(DcQueryParser.class))
{
LogUtil.getInstance().debug(DcQueryParser.class,
"Create a new term query for: " + fieldName);
}

mFieldNames.add(fieldName);
IFieldDefinition fieldDef = mIndexDef.getFieldDefinition(fieldName);
if (fieldDef != null && fieldDef.isNumericField())
{
System.out.println("..2");
String queryString = term.text().trim();
if (queryString.startsWith("+"))
{
queryString.substring(1);
}
if (fieldDef.getFieldType() == IFieldDefinition.FieldType.INT)
{
return new TermQuery(new Term(term.field(),
NumericUtils.intToPrefixCoded(Integer.parseInt(queryString;
} 
else if (fieldDef.getFieldType() == IFieldDefinition.FieldType.LONG)
{
return new TermQuery(new Term(term.field(),
NumericUtils.longToPrefixCoded(Long.parseLong(queryString;
}
else if (fieldDef.getFieldType() == 
IFieldDefinition.FieldType.FLOAT)
{
   return new TermQuery(new Term(term.field(),
NumericUtils.floatToPrefixCoded(Float.parseFloat(queryString;
}
else if (fieldDef.getFieldType() == 
IFieldDefinition.Field

What is the best way to handle the primary key case during lucene indexing

2009-11-16 Thread java8964 java8964

Hi, 

In our application, we will allow the user to create a primary key defined in 
the document. We are using lucene 2.9.
In this case, when we index the data coming from the client, if the metadata 
contains the primary key defined, 
we have to do the search/update for every row based on the primary key.

Here is our current problems:

1) If the meta data coming from client defined a primary key (which can contain 
one or multi fields), 
then for the data supplied from the client, we have to make sure that later 
row will override the previous row, if they have the same primary key as the 
data.
2) To do the above, we have to loop through the data first, to check if any 
later rows containing the same PK as the previous rows, so we will build the 
MAP in the memory to override the previous one by the latest ones.
This is a very expensive operation. 
3) Even in this case, for every row after the above filter steps, we still have 
to search the current index to see if any data with the same PK exist or not. 
So we have to do the remove before we add the new data in the index.

I want to know if anyone has the same requirement like this PK using the 
lucene? What is the best way to index data in this case?

First, I am thinking if it is possible to remove the above step2?
the problem for the lucene is that when we add document in the index, we can 
NOT search it before commit it.
But we only commit once when the whole data file is finished. So we have to 
loop through the data once to check to see if any data sharing the same PK in 
the data file.
I am wondering if there is a way in the index writer, before it commits 
anything, when we add the new document into it, it can do the merging of the PK 
data? What I mean is that if the same PK data already exist in any previous 
added document, just remove it and let the new added data containing the same 
PK data take the place? If we can do this, then the whole pre checking data 
step can be removed.

Second, for the above step 3, if the searching the existing index is NOT 
avoidable, what is the fast way to search by the PK? Of course we already 
indexed all the PK fields. When we add new data, we have to search every row of 
existing index by the PK fields, to see if it exist or not. If it does, remove 
it and add the new one.
We constructor the query by the PK fields at run time, then search it row by 
row. This is also very bad as the indexing the data for performance.

Here is what I am thinking?
1) Can I use the Indexreader.term(terms)? I heard it is much faster than the 
query searching? Is that right?
2) Currently we are do the search row by row? Should I do it in batching? Like 
I will combine 100 PK search into one search, using Boolean term? So one search 
will give me back all the data in this 100 PK which are in the index. Then I 
can remove them from the index using the result set. In this case, I only need 
to do 1/100 search requests as before? This will much faster than row by row in 
theory.


Please let me know any feedbacks? If you ever dealed with PK data support, 
please share some thougths and experience.

Thanks for your kind help.
  
_
Hotmail: Free, trusted and rich email service.
http://clk.atdmt.com/GBL/go/171222984/direct/01/

RE: What is the best way to handle the primary key case during lucene indexing

2009-11-16 Thread java8964 java8964

What I mean is that for one index, client can defined multi field in the index 
as the primary key (composite key).
> Date: Mon, 16 Nov 2009 12:45:40 -0500
> Subject: Re: What is the best way to handle the primary key case during 
> luceneindexing
> From: [email protected]
> To: [email protected]
> 
> What is the form of the unique key? I'm a bit confused here by your comment:
> "which can contain one or multi fields".
> 
> But it seems like IndexWriter.deleteDocuments should work here. It's easy
> if your PKs are single terms, there's even a deleteDocuments(Term[]) form.
> But this really *requires* that your PKs are single terms in a field. If
> your PKs
> are some sort of composite field, perhaps the iw.DeleteDocuments(Query[])
> would help where each query is enough to uniquely identify your document.
> 
> Best
> Erick
> 
> On Mon, Nov 16, 2009 at 12:15 PM, java8964 java8964 
> wrote:
> 
> >
> > Hi,
> >
> > In our application, we will allow the user to create a primary key defined
> > in the document. We are using lucene 2.9.
> > In this case, when we index the data coming from the client, if the
> > metadata contains the primary key defined,
> > we have to do the search/update for every row based on the primary key.
> >
> > Here is our current problems:
> >
> > 1) If the meta data coming from client defined a primary key (which can
> > contain one or multi fields),
> >then for the data supplied from the client, we have to make sure that
> > later row will override the previous row, if they have the same primary key
> > as the data.
> > 2) To do the above, we have to loop through the data first, to check if any
> > later rows containing the same PK as the previous rows, so we will build the
> > MAP in the memory to override the previous one by the latest ones.
> > This is a very expensive operation.
> > 3) Even in this case, for every row after the above filter steps, we still
> > have to search the current index to see if any data with the same PK exist
> > or not. So we have to do the remove before we add the new data in the index.
> >
> > I want to know if anyone has the same requirement like this PK using the
> > lucene? What is the best way to index data in this case?
> >
> > First, I am thinking if it is possible to remove the above step2?
> > the problem for the lucene is that when we add document in the index, we
> > can NOT search it before commit it.
> > But we only commit once when the whole data file is finished. So we have to
> > loop through the data once to check to see if any data sharing the same PK
> > in the data file.
> > I am wondering if there is a way in the index writer, before it commits
> > anything, when we add the new document into it, it can do the merging of the
> > PK data? What I mean is that if the same PK data already exist in any
> > previous added document, just remove it and let the new added data
> > containing the same PK data take the place? If we can do this, then the
> > whole pre checking data step can be removed.
> >
> > Second, for the above step 3, if the searching the existing index is NOT
> > avoidable, what is the fast way to search by the PK? Of course we already
> > indexed all the PK fields. When we add new data, we have to search every row
> > of existing index by the PK fields, to see if it exist or not. If it does,
> > remove it and add the new one.
> > We constructor the query by the PK fields at run time, then search it row
> > by row. This is also very bad as the indexing the data for performance.
> >
> > Here is what I am thinking?
> > 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
> > the query searching? Is that right?
> > 2) Currently we are do the search row by row? Should I do it in batching?
> > Like I will combine 100 PK search into one search, using Boolean term? So
> > one search will give me back all the data in this 100 PK which are in the
> > index. Then I can remove them from the index using the result set. In this
> > case, I only need to do 1/100 search requests as before? This will much
> > faster than row by row in theory.
> >
> >
> > Please let me know any feedbacks? If you ever dealed with PK data support,
> > please share some thougths and experience.
> >
> > Thanks for your kind help.
> >
> > _
> > Hotmail: Free, trusted and rich email service.
> > http://clk.atdmt.com/GBL/go/171222984/direct/01/
> >
  
_
Hotmail: Free, trusted and rich email service.
http://clk.atdmt.com/GBL/go/171222984/direct/01/

RE: What is the best way to handle the primary key case during lucene indexing

2009-11-16 Thread java8964 java8964

But can IndexWriter.updateDocument(Term, Document) handle the composite key 
case?

If my primary key contains field1 and field2, can I use one Term to include 
both field1 and field2?

Thanks

> Date: Mon, 16 Nov 2009 09:44:35 -0800
> Subject: Re: What is the best way to handle the primary key case during 
> luceneindexing
> From: [email protected]
> To: [email protected]
> 
> The usual way to do this is to use:
> 
>IndexWriter.updateDocument(Term, Document)
> 
> This method deletes all documents with the given Term in it (this would be
> your primary key), and then adds the Document you want to add.  This is the
> traditional way to do updates, and it is fast.
> 
>   -jake
> 
> 
> 
> On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 
> wrote:
> 
> >
> > Hi,
> >
> > In our application, we will allow the user to create a primary key defined
> > in the document. We are using lucene 2.9.
> > In this case, when we index the data coming from the client, if the
> > metadata contains the primary key defined,
> > we have to do the search/update for every row based on the primary key.
> >
> > Here is our current problems:
> >
> > 1) If the meta data coming from client defined a primary key (which can
> > contain one or multi fields),
> >then for the data supplied from the client, we have to make sure that
> > later row will override the previous row, if they have the same primary key
> > as the data.
> > 2) To do the above, we have to loop through the data first, to check if any
> > later rows containing the same PK as the previous rows, so we will build the
> > MAP in the memory to override the previous one by the latest ones.
> > This is a very expensive operation.
> > 3) Even in this case, for every row after the above filter steps, we still
> > have to search the current index to see if any data with the same PK exist
> > or not. So we have to do the remove before we add the new data in the index.
> >
> > I want to know if anyone has the same requirement like this PK using the
> > lucene? What is the best way to index data in this case?
> >
> > First, I am thinking if it is possible to remove the above step2?
> > the problem for the lucene is that when we add document in the index, we
> > can NOT search it before commit it.
> > But we only commit once when the whole data file is finished. So we have to
> > loop through the data once to check to see if any data sharing the same PK
> > in the data file.
> > I am wondering if there is a way in the index writer, before it commits
> > anything, when we add the new document into it, it can do the merging of the
> > PK data? What I mean is that if the same PK data already exist in any
> > previous added document, just remove it and let the new added data
> > containing the same PK data take the place? If we can do this, then the
> > whole pre checking data step can be removed.
> >
> > Second, for the above step 3, if the searching the existing index is NOT
> > avoidable, what is the fast way to search by the PK? Of course we already
> > indexed all the PK fields. When we add new data, we have to search every row
> > of existing index by the PK fields, to see if it exist or not. If it does,
> > remove it and add the new one.
> > We constructor the query by the PK fields at run time, then search it row
> > by row. This is also very bad as the indexing the data for performance.
> >
> > Here is what I am thinking?
> > 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
> > the query searching? Is that right?
> > 2) Currently we are do the search row by row? Should I do it in batching?
> > Like I will combine 100 PK search into one search, using Boolean term? So
> > one search will give me back all the data in this 100 PK which are in the
> > index. Then I can remove them from the index using the result set. In this
> > case, I only need to do 1/100 search requests as before? This will much
> > faster than row by row in theory.
> >
> >
> > Please let me know any feedbacks? If you ever dealed with PK data support,
> > please share some thougths and experience.
> >
> > Thanks for your kind help.
> >
> > _
> > Hotmail: Free, trusted and rich email service.
> > http://clk.atdmt.com/GBL/go/171222984/direct/01/
> >
  
_
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/171222986/direct/01/

During the wild card search, will lucene 2.9.0 to convert the search string to lower case?

2010-02-01 Thread java8964 java8964

I noticed a strange result from the following test case. For wildcard search, 
my understanding is that lucene will NOT use any analyzer on the query string. 
But as the following simple code to show, it looks like that lucene will lower 
case the search query in the wildcard search. Why? If not, why the following 
test case show the search hits as one for lower case wildcard search, but not 
for the upper case data? My original data is NOT analyzed, so they should be 
stored as the original data in the index segment, right?

Lucene version: 2.9.0

JDK version: JDK 1.6.0_17


public class IndexTest1 {
public static void main(String[] args) {
try {
Directory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(Version.LUCENE_CURRENT), IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field("title", "BBB CCC", Field.Store.YES, 
Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field("title", "ddd eee", Field.Store.YES, 
Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

writer.close();

IndexSearcher searcher = new IndexSearcher(directory, true);
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new 
StandardAnalyzer(Version.LUCENE_CURRENT));
wrapper.addAnalyzer("title", new KeywordAnalyzer());
Query query = new QueryParser("title",
wrapper).parse("title:BBB*");
System.out.println("hits of title = " + searcher.search(query, 
100).totalHits);
query = new QueryParser("title",
wrapper).parse("title:ddd*");
System.out.println("hits of title = " + searcher.search(query, 
100).totalHits);
searcher.close();
} catch (Exception e) {
System.out.println(e);
}
}
}

The output:
hits of title = 0
hits of title = 1

  
_
Hotmail: Trusted email with powerful SPAM protection.
http://clk.atdmt.com/GBL/go/201469227/direct/01/

RE: During the wild card search, will lucene 2.9.0 to convert the search string to lower case?

2010-02-01 Thread java8964 java8964

I would like to confirm your reply. You mean that the query parse will lower 
casing. In fact, it looks like that it only does this for wild card query, 
right?

For the term query, it didn't. As proved by if you change the line to:

Query query = new QueryParser("title", wrapper).parse("title:\"BBB 
CCC\"");

You will get 1 hits back. So in this case, the query parser class did in 
different way for term query and wild card query.

We have to use the query parse in this case, but we have our own Query parser 
class extends from the lucene query parser class. Anything we can do to about 
it?

Will lucense's query parser class be fixed for the above inconsistent 
implementation?

Thanks


> From: [email protected]
> To: [email protected]
> Subject: RE: During the wild card search, will lucene 2.9.0 to convert the 
> search string to lower case?
> Date: Mon, 1 Feb 2010 17:41:08 +0100
> 
> Only query parser does the lower casing. For such a special case, I would 
> suggest to use a PrefixQuery or WildcardQuery directly and not use query 
> parser.
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
> 
> > -Original Message-
> > From: java8964 java8964 [mailto:[email protected]]
> > Sent: Monday, February 01, 2010 5:27 PM
> > To: [email protected]
> > Subject: During the wild card search, will lucene 2.9.0 to convert the
> > search string to lower case?
> > 
> > 
> > I noticed a strange result from the following test case. For wildcard
> > search, my understanding is that lucene will NOT use any analyzer on
> > the query string. But as the following simple code to show, it looks
> > like that lucene will lower case the search query in the wildcard
> > search. Why? If not, why the following test case show the search hits
> > as one for lower case wildcard search, but not for the upper case data?
> > My original data is NOT analyzed, so they should be stored as the
> > original data in the index segment, right?
> > 
> > Lucene version: 2.9.0
> > 
> > JDK version: JDK 1.6.0_17
> > 
> > 
> > public class IndexTest1 {
> > public static void main(String[] args) {
> > try {
> > Directory directory = new RAMDirectory();
> > IndexWriter writer = new IndexWriter(directory, new
> > StandardAnalyzer(Version.LUCENE_CURRENT),
> > IndexWriter.MaxFieldLength.UNLIMITED);
> > Document doc = new Document();
> > doc.add(new Field("title", "BBB CCC", Field.Store.YES,
> > Field.Index.NOT_ANALYZED));
> > writer.addDocument(doc);
> > doc = new Document();
> > doc.add(new Field("title", "ddd eee", Field.Store.YES,
> > Field.Index.NOT_ANALYZED));
> > writer.addDocument(doc);
> > 
> > writer.close();
> > 
> > IndexSearcher searcher = new IndexSearcher(directory,
> > true);
> > PerFieldAnalyzerWrapper wrapper = new
> > PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_CURRENT));
> > wrapper.addAnalyzer("title", new KeywordAnalyzer());
> > Query query = new QueryParser("title",
> > wrapper).parse("title:BBB*");
> > System.out.println("hits of title = " +
> > searcher.search(query, 100).totalHits);
> > query = new QueryParser("title",
> > wrapper).parse("title:ddd*");
> > System.out.println("hits of title = " +
> > searcher.search(query, 100).totalHits);
> > searcher.close();
> > } catch (Exception e) {
> > System.out.println(e);
> > }
> > }
> > }
> > 
> > The output:
> > hits of title = 0
> > hits of title = 1
> > 
> > 
> > _
> > Hotmail: Trusted email with powerful SPAM protection.
> > http://clk.atdmt.com/GBL/go/201469227/direct/01/
> 
> 
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
  
_
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/

RE: During the wild card search, will lucene 2.9.0 to convert the search string to lower case?

2010-02-01 Thread java8964 java8964

This is maybe something I am looking for. We are using the default value, which 
is true.

Let me examine this method more.

Thanks for your help.

> From: [email protected]
> To: [email protected]
> Subject: RE: During the wild card search, will lucene 2.9.0 to convert the 
> search string to lower case?
> Date: Mon, 1 Feb 2010 20:36:29 +0200
> 
> Did you try queryParser.SetLowercaseExpandedTerms(false)?
> 
> DIGY
> 
> -----Original Message-
> From: java8964 java8964 [mailto:[email protected]] 
> Sent: Monday, February 01, 2010 8:11 PM
> To: [email protected]
> Subject: RE: During the wild card search, will lucene 2.9.0 to convert the
> search string to lower case?
> 
> 
> I would like to confirm your reply. You mean that the query parse will lower
> casing. In fact, it looks like that it only does this for wild card query,
> right?
> 
> For the term query, it didn't. As proved by if you change the line to:
> 
> Query query = new QueryParser("title",
> wrapper).parse("title:\"BBB CCC\"");
> 
> You will get 1 hits back. So in this case, the query parser class did in
> different way for term query and wild card query.
> 
> We have to use the query parse in this case, but we have our own Query
> parser class extends from the lucene query parser class. Anything we can do
> to about it?
> 
> Will lucense's query parser class be fixed for the above inconsistent
> implementation?
> 
> Thanks
> 
> 
> > From: [email protected]
> > To: [email protected]
> > Subject: RE: During the wild card search, will lucene 2.9.0 to convert the
> search string to lower case?
> > Date: Mon, 1 Feb 2010 17:41:08 +0100
> > 
> > Only query parser does the lower casing. For such a special case, I would
> suggest to use a PrefixQuery or WildcardQuery directly and not use query
> parser.
> > 
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [email protected]
> > 
> > > -Original Message-
> > > From: java8964 java8964 [mailto:[email protected]]
> > > Sent: Monday, February 01, 2010 5:27 PM
> > > To: [email protected]
> > > Subject: During the wild card search, will lucene 2.9.0 to convert the
> > > search string to lower case?
> > > 
> > > 
> > > I noticed a strange result from the following test case. For wildcard
> > > search, my understanding is that lucene will NOT use any analyzer on
> > > the query string. But as the following simple code to show, it looks
> > > like that lucene will lower case the search query in the wildcard
> > > search. Why? If not, why the following test case show the search hits
> > > as one for lower case wildcard search, but not for the upper case data?
> > > My original data is NOT analyzed, so they should be stored as the
> > > original data in the index segment, right?
> > > 
> > > Lucene version: 2.9.0
> > > 
> > > JDK version: JDK 1.6.0_17
> > > 
> > > 
> > > public class IndexTest1 {
> > > public static void main(String[] args) {
> > > try {
> > > Directory directory = new RAMDirectory();
> > > IndexWriter writer = new IndexWriter(directory, new
> > > StandardAnalyzer(Version.LUCENE_CURRENT),
> > > IndexWriter.MaxFieldLength.UNLIMITED);
> > > Document doc = new Document();
> > > doc.add(new Field("title", "BBB CCC", Field.Store.YES,
> > > Field.Index.NOT_ANALYZED));
> > > writer.addDocument(doc);
> > > doc = new Document();
> > > doc.add(new Field("title", "ddd eee", Field.Store.YES,
> > > Field.Index.NOT_ANALYZED));
> > > writer.addDocument(doc);
> > > 
> > > writer.close();
> > > 
> > > IndexSearcher searcher = new IndexSearcher(directory,
> > > true);
> > > PerFieldAnalyzerWrapper wrapper = new
> > > PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_CURRENT));
> > > wrapper.addAnalyzer("title", new KeywordAnalyzer());
> > > Query query = new QueryParser("title",
> > > wrapper).parse("title:BBB*");
> > > System.out.println("hits of title = " +
> > > searcher.search(query, 100).totalHits);
> > > 

confused by the lucene boolean query with wildcard result

2010-02-02 Thread java8964 java8964

Hi, I have the following test case point to the index generated in our 
application. The result is confusing me and I don't know the reason.

Lucene version: 2.9.0
JDK 1.6.0_18

public class IndexTest1 {
public static void main(String[] args) {
try {
FSDirectory directory = FSDirectory.open(new 
File("/path_to_index_files"));
IndexSearcher searcher = new IndexSearcher(directory, true);
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new 
StandardAnalyzer());
wrapper.addAnalyzer("f1string_sif", new KeywordAnalyzer());
wrapper.addAnalyzer("f2string_ti", new 
StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = new QueryParser("f1string_sif", new 
StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank*");
System.out.println("query = " + query);
System.out.println("hits = " + searcher.search(query, 
100).totalHits);
searcher.close();
} catch (Exception e) {
System.out.println(e);
}
}
}

Output:
query = f2string_ti:subbank*
hits = 6

If I change the line to the following:

Query query = new QueryParser("f1string_sif", new 
StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:rdmap*");

Output:
query = f2string_ti:rdmap*
hits = 4

The above result are both correct based on my data.

Now if I change the line to:

Query query = new QueryParser("f1string_sif", new 
StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank* OR 
f2string_ti:rdmap*");

Output:
query = f2string_ti:subbank* f2string_ti:rdmap*
hits = 2


I assume the count in the last result should be larger than max(6,4), but it is 
2. Any reason for that?

Thanks

  
_
Hotmail: Trusted email with powerful SPAM protection.
http://clk.atdmt.com/GBL/go/201469227/direct/01/

RE: During the wild card search, will lucene 2.9.0 to convert the search string to lower case?

2010-02-02 Thread java8964 java8964

Is there an analyzer like keyword analyzer, but will also lowering the data 
from lucene? Or I have to do a customer analyzer by myself?

Thanks

> From: [email protected]
> To: [email protected]
> Subject: RE: During the wild card search, will lucene 2.9.0 to convert the 
> search string to lower case?
> Date: Mon, 1 Feb 2010 14:24:00 -0500
> 
> 
> This is maybe something I am looking for. We are using the default value, 
> which is true.
> 
> Let me examine this method more.
> 
> Thanks for your help.
> 
> > From: [email protected]
> > To: [email protected]
> > Subject: RE: During the wild card search, will lucene 2.9.0 to convert the 
> > search string to lower case?
> > Date: Mon, 1 Feb 2010 20:36:29 +0200
> > 
> > Did you try queryParser.SetLowercaseExpandedTerms(false)?
> > 
> > DIGY
> > 
> > -Original Message-
> > From: java8964 java8964 [mailto:[email protected]] 
> > Sent: Monday, February 01, 2010 8:11 PM
> > To: [email protected]
> > Subject: RE: During the wild card search, will lucene 2.9.0 to convert the
> > search string to lower case?
> > 
> > 
> > I would like to confirm your reply. You mean that the query parse will lower
> > casing. In fact, it looks like that it only does this for wild card query,
> > right?
> > 
> > For the term query, it didn't. As proved by if you change the line to:
> > 
> > Query query = new QueryParser("title",
> > wrapper).parse("title:\"BBB CCC\"");
> > 
> > You will get 1 hits back. So in this case, the query parser class did in
> > different way for term query and wild card query.
> > 
> > We have to use the query parse in this case, but we have our own Query
> > parser class extends from the lucene query parser class. Anything we can do
> > to about it?
> > 
> > Will lucense's query parser class be fixed for the above inconsistent
> > implementation?
> > 
> > Thanks
> > 
> > 
> > > From: [email protected]
> > > To: [email protected]
> > > Subject: RE: During the wild card search, will lucene 2.9.0 to convert the
> > search string to lower case?
> > > Date: Mon, 1 Feb 2010 17:41:08 +0100
> > > 
> > > Only query parser does the lower casing. For such a special case, I would
> > suggest to use a PrefixQuery or WildcardQuery directly and not use query
> > parser.
> > > 
> > > -
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: [email protected]
> > > 
> > > > -Original Message-
> > > > From: java8964 java8964 [mailto:[email protected]]
> > > > Sent: Monday, February 01, 2010 5:27 PM
> > > > To: [email protected]
> > > > Subject: During the wild card search, will lucene 2.9.0 to convert the
> > > > search string to lower case?
> > > > 
> > > > 
> > > > I noticed a strange result from the following test case. For wildcard
> > > > search, my understanding is that lucene will NOT use any analyzer on
> > > > the query string. But as the following simple code to show, it looks
> > > > like that lucene will lower case the search query in the wildcard
> > > > search. Why? If not, why the following test case show the search hits
> > > > as one for lower case wildcard search, but not for the upper case data?
> > > > My original data is NOT analyzed, so they should be stored as the
> > > > original data in the index segment, right?
> > > > 
> > > > Lucene version: 2.9.0
> > > > 
> > > > JDK version: JDK 1.6.0_17
> > > > 
> > > > 
> > > > public class IndexTest1 {
> > > > public static void main(String[] args) {
> > > > try {
> > > > Directory directory = new RAMDirectory();
> > > > IndexWriter writer = new IndexWriter(directory, new
> > > > StandardAnalyzer(Version.LUCENE_CURRENT),
> > > > IndexWriter.MaxFieldLength.UNLIMITED);
> > > > Document doc = new Document();
> > > > doc.add(new Field("title", "BBB CCC", Field.Store.YES,
> > > > Field.Index.NOT_ANALYZED));
> > > > writer.addDocument(doc);
> > > > doc = new Document();
> > > > doc.add(new Field(&quo

RE: During the wild card search, will lucene 2.9.0 to convert the search string to lower case?

2010-02-03 Thread java8964 java8964

Thanks for your help.

My concern now is that the field could be defined as store. So when the user 
receive the field data, we want to still show the original data, in upper case 
in this case.

First, I don't think I can use queryParser.SetLowercaseExpandedTerms(false), 
which will remove the wildcard search case insensitive functionality for 
tokenized field.

To handle this case, if the data is NOT tokenized, but contain upper case data, 
to be able do the wildcard search with uppercase letter, like 'BB*', I am 
thinking that I have to analyzer the non tokenized data, using a 
KeywordTokenizer plus lowercase the data.

For your suggestion, will the data change to lower case and stored in the 
lucene when it being retrieved?

Thanks

> From: [email protected]
> To: [email protected]
> Subject: RE: During the wild card search, will lucene 2.9.0 to convert the
> search string to lower case?
> Date: Wed, 3 Feb 2010 11:17:27 +0100
> 
> For specific fields using a special TokenStream chain, there is no need to 
> write a separate analyzer. You can add fields to a document using a 
> TokenStream as parameter: new Field(name, TokenStream).
> 
> As TokenStream just create a chain from Tokenizer and all Filters like:
> 
> TokenStream ts = new KeywordTokenizer(new StringReader("your text to index"));
> ts = new LowercaseFilter(ts);
> ...
> document.add("fieldname", ts);
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
> 
> 
> > -Original Message-
> > From: Ian Lea [mailto:[email protected]]
> > Sent: Wednesday, February 03, 2010 11:06 AM
> > To: [email protected]
> > Subject: Re: During the wild card search, will lucene 2.9.0 to convert
> > the search string to lower case?
> > 
> > I think you'll have to write your own.  Or just downcase the text
> > yourself first.
> > 
> > 
> > --
> > Ian.
> > 
> > 
> > On Tue, Feb 2, 2010 at 9:30 PM, java8964 java8964
> >  wrote:
> > >
> > > Is there an analyzer like keyword analyzer, but will also lowering
> > the data from lucene? Or I have to do a customer analyzer by myself?
> > >
> > > Thanks
> > >
> > >> From: [email protected]
> > >> To: [email protected]
> > >> Subject: RE: During the wild card search, will lucene 2.9.0 to
> > convert the search string to lower case?
> > >> Date: Mon, 1 Feb 2010 14:24:00 -0500
> > >>
> > >>
> > >> This is maybe something I am looking for. We are using the default
> > value, which is true.
> > >>
> > >> Let me examine this method more.
> > >>
> > >> Thanks for your help.
> > >>
> > >> > From: [email protected]
> > >> > To: [email protected]
> > >> > Subject: RE: During the wild card search, will lucene 2.9.0 to
> > convert the search string to lower case?
> > >> > Date: Mon, 1 Feb 2010 20:36:29 +0200
> > >> >
> > >> > Did you try queryParser.SetLowercaseExpandedTerms(false)?
> > >> >
> > >> > DIGY
> > >> >
> > >> > -Original Message-
> > >> > From: java8964 java8964 [mailto:[email protected]]
> > >> > Sent: Monday, February 01, 2010 8:11 PM
> > >> > To: [email protected]
> > >> > Subject: RE: During the wild card search, will lucene 2.9.0 to
> > convert the
> > >> > search string to lower case?
> > >> >
> > >> >
> > >> > I would like to confirm your reply. You mean that the query parse
> > will lower
> > >> > casing. In fact, it looks like that it only does this for wild
> > card query,
> > >> > right?
> > >> >
> > >> > For the term query, it didn't. As proved by if you change the line
> > to:
> > >> >
> > >> > Query query = new QueryParser("title",
> > >> > wrapper).parse("title:\"BBB CCC\"");
> > >> >
> > >> > You will get 1 hits back. So in this case, the query parser class
> > did in
> > >> > different way for term query and wild card query.
> > >> >
> > >> > We have to use the query parse in this case, but we have our own
> > Query
> > >> > parser class extends from the lucene query parser class. Anything
> > we can do
> &

RE: confused by the lucene boolean query with wildcard result

2010-02-03 Thread java8964 java8964

Thanks for you help.

I upgrade the lucene to 2.9.1, the problem is gone. It looks like a boolean 
query bug in the lucene 2.9.0 and fixed in the 2.9.1

Thanks

> From: [email protected]
> Date: Wed, 3 Feb 2010 10:02:27 +
> Subject: Re: confused by the lucene boolean query with wildcard result
> To: [email protected]
> 
> You should probably be using your PerFieldAnalyzerWrapper in your
> calls to QueryParser but apart from that I can't see any obvious
> reason.  General advice: use Luke to check what has been indexed and
> read 
> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F
> 
> If none of these help, post again but showing what you are indexing as
> well as how you are searching - the smallest possible test case or
> self-contained program that shows the problem.
> 
> Or maybe someone else will spot the problem.
> 
> 
> --
> Ian.
> 
> 
> 
> On Tue, Feb 2, 2010 at 8:56 PM, java8964 java8964  
> wrote:
> >
> > Hi, I have the following test case point to the index generated in our 
> > application. The result is confusing me and I don't know the reason.
> >
> > Lucene version: 2.9.0
> > JDK 1.6.0_18
> >
> > public class IndexTest1 {
> >public static void main(String[] args) {
> >try {
> >FSDirectory directory = FSDirectory.open(new 
> > File("/path_to_index_files"));
> >IndexSearcher searcher = new IndexSearcher(directory, true);
> >PerFieldAnalyzerWrapper wrapper = new 
> > PerFieldAnalyzerWrapper(new StandardAnalyzer());
> >wrapper.addAnalyzer("f1string_sif", new KeywordAnalyzer());
> >wrapper.addAnalyzer("f2string_ti", new 
> > StandardAnalyzer(Version.LUCENE_CURRENT));
> >Query query = new QueryParser("f1string_sif", new 
> > StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank*");
> >System.out.println("query = " + query);
> >System.out.println("hits = " + searcher.search(query, 
> > 100).totalHits);
> >searcher.close();
> >} catch (Exception e) {
> >System.out.println(e);
> >}
> >}
> > }
> >
> > Output:
> > query = f2string_ti:subbank*
> > hits = 6
> >
> > If I change the line to the following:
> >
> > Query query = new QueryParser("f1string_sif", new 
> > StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:rdmap*");
> >
> > Output:
> > query = f2string_ti:rdmap*
> > hits = 4
> >
> > The above result are both correct based on my data.
> >
> > Now if I change the line to:
> >
> > Query query = new QueryParser("f1string_sif", new 
> > StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank* OR 
> > f2string_ti:rdmap*");
> >
> > Output:
> > query = f2string_ti:subbank* f2string_ti:rdmap*
> > hits = 2
> >
> >
> > I assume the count in the last result should be larger than max(6,4), but 
> > it is 2. Any reason for that?
> >
> > Thanks
> >
> >
> > _
> > Hotmail: Trusted email with powerful SPAM protection.
> > http://clk.atdmt.com/GBL/go/201469227/direct/01/
> 
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
  
_
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/