Re: Indexing Query

2015-02-18 Thread Ian Lea
You mean you'd like a BooleanQuery.setMaximumNumberShouldMatch()
method?  Unfortunately that doesn't exist and I can't think of a
simple way of doing it.


--
Ian.


On Wed, Feb 18, 2015 at 5:26 AM, Deepak Gopalakrishnan dgk...@gmail.com wrote:
 Thanks Ian. Also, if I have a unigram in the query, and I want to make sure
 I match only index entries that do not have more than 2 tokens, is there a
 way to do that too?

 Thanks

 On Wed, Feb 18, 2015 at 2:23 AM, Ian Lea ian@gmail.com wrote:

 Break the query into words then add them as TermQuery instances as
 optional clauses to a BooleanQuery with a call to
 setMinimumNumberShouldMatch(2) somewhere along the line.  You may want
 to do some parsing or analysis on the query terms to avoid problems of
 case matching and the like.


 --
 Ian.


 On Tue, Feb 17, 2015 at 4:57 PM, Deepak Gopalakrishnan dgk...@gmail.com
 wrote:
  Hello,
 
  I have a rather simple query. I have a list where I have terms like and
  then my query is more natural language. I want to be able to retrieve
   matches that has atleast 2 words in common between the query and the
 index
 
  Can you guys suggest a Query Type and a field that I should be using?
 
  --
  Regards,
  *Deepak Gopalakrishnan*

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 Regards,
 *Deepak Gopalakrishnan*
 *Mobile*:+918891509774
 *Skype* : deepakgk87
 http://myexps.blogspot.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing Query

2015-02-18 Thread Deepak Gopalakrishnan
Oops, alright, I'll probably look around for a workaround.

On Wed, Feb 18, 2015 at 3:24 PM, Ian Lea ian@gmail.com wrote:

 You mean you'd like a BooleanQuery.setMaximumNumberShouldMatch()
 method?  Unfortunately that doesn't exist and I can't think of a
 simple way of doing it.


 --
 Ian.


 On Wed, Feb 18, 2015 at 5:26 AM, Deepak Gopalakrishnan dgk...@gmail.com
 wrote:
  Thanks Ian. Also, if I have a unigram in the query, and I want to make
 sure
  I match only index entries that do not have more than 2 tokens, is there
 a
  way to do that too?
 
  Thanks
 
  On Wed, Feb 18, 2015 at 2:23 AM, Ian Lea ian@gmail.com wrote:
 
  Break the query into words then add them as TermQuery instances as
  optional clauses to a BooleanQuery with a call to
  setMinimumNumberShouldMatch(2) somewhere along the line.  You may want
  to do some parsing or analysis on the query terms to avoid problems of
  case matching and the like.
 
 
  --
  Ian.
 
 
  On Tue, Feb 17, 2015 at 4:57 PM, Deepak Gopalakrishnan 
 dgk...@gmail.com
  wrote:
   Hello,
  
   I have a rather simple query. I have a list where I have terms like
 and
   then my query is more natural language. I want to be able to retrieve
matches that has atleast 2 words in common between the query and the
  index
  
   Can you guys suggest a Query Type and a field that I should be using?
  
   --
   Regards,
   *Deepak Gopalakrishnan*
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
  --
  Regards,
  *Deepak Gopalakrishnan*
  *Mobile*:+918891509774
  *Skype* : deepakgk87
  http://myexps.blogspot.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Regards,
*Deepak Gopalakrishnan*
*Mobile*:+918891509774
*Skype* : deepakgk87
http://myexps.blogspot.com


Re: Indexing Query

2015-02-18 Thread Jack Krupansky
You could store the length of the field (in terms) in a second field and
then add a MUST term to the BooleanQuery which is a RangeQuery with an
upper bound that is the maximum length that can match.

-- Jack Krupansky

On Wed, Feb 18, 2015 at 4:54 AM, Ian Lea ian@gmail.com wrote:

 You mean you'd like a BooleanQuery.setMaximumNumberShouldMatch()
 method?  Unfortunately that doesn't exist and I can't think of a
 simple way of doing it.


 --
 Ian.


 On Wed, Feb 18, 2015 at 5:26 AM, Deepak Gopalakrishnan dgk...@gmail.com
 wrote:
  Thanks Ian. Also, if I have a unigram in the query, and I want to make
 sure
  I match only index entries that do not have more than 2 tokens, is there
 a
  way to do that too?
 
  Thanks
 
  On Wed, Feb 18, 2015 at 2:23 AM, Ian Lea ian@gmail.com wrote:
 
  Break the query into words then add them as TermQuery instances as
  optional clauses to a BooleanQuery with a call to
  setMinimumNumberShouldMatch(2) somewhere along the line.  You may want
  to do some parsing or analysis on the query terms to avoid problems of
  case matching and the like.
 
 
  --
  Ian.
 
 
  On Tue, Feb 17, 2015 at 4:57 PM, Deepak Gopalakrishnan 
 dgk...@gmail.com
  wrote:
   Hello,
  
   I have a rather simple query. I have a list where I have terms like
 and
   then my query is more natural language. I want to be able to retrieve
matches that has atleast 2 words in common between the query and the
  index
  
   Can you guys suggest a Query Type and a field that I should be using?
  
   --
   Regards,
   *Deepak Gopalakrishnan*
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
  --
  Regards,
  *Deepak Gopalakrishnan*
  *Mobile*:+918891509774
  *Skype* : deepakgk87
  http://myexps.blogspot.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Indexing Query

2015-02-17 Thread Deepak Gopalakrishnan
Thanks Ian. Also, if I have a unigram in the query, and I want to make sure
I match only index entries that do not have more than 2 tokens, is there a
way to do that too?

Thanks

On Wed, Feb 18, 2015 at 2:23 AM, Ian Lea ian@gmail.com wrote:

 Break the query into words then add them as TermQuery instances as
 optional clauses to a BooleanQuery with a call to
 setMinimumNumberShouldMatch(2) somewhere along the line.  You may want
 to do some parsing or analysis on the query terms to avoid problems of
 case matching and the like.


 --
 Ian.


 On Tue, Feb 17, 2015 at 4:57 PM, Deepak Gopalakrishnan dgk...@gmail.com
 wrote:
  Hello,
 
  I have a rather simple query. I have a list where I have terms like and
  then my query is more natural language. I want to be able to retrieve
   matches that has atleast 2 words in common between the query and the
 index
 
  Can you guys suggest a Query Type and a field that I should be using?
 
  --
  Regards,
  *Deepak Gopalakrishnan*

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Regards,
*Deepak Gopalakrishnan*
*Mobile*:+918891509774
*Skype* : deepakgk87
http://myexps.blogspot.com


Indexing Query

2015-02-17 Thread Deepak Gopalakrishnan
Hello,

I have a rather simple query. I have a list where I have terms like and
then my query is more natural language. I want to be able to retrieve
 matches that has atleast 2 words in common between the query and the index

Can you guys suggest a Query Type and a field that I should be using?

-- 
Regards,
*Deepak Gopalakrishnan*


Re: Indexing Query

2015-02-17 Thread Ian Lea
Break the query into words then add them as TermQuery instances as
optional clauses to a BooleanQuery with a call to
setMinimumNumberShouldMatch(2) somewhere along the line.  You may want
to do some parsing or analysis on the query terms to avoid problems of
case matching and the like.


--
Ian.


On Tue, Feb 17, 2015 at 4:57 PM, Deepak Gopalakrishnan dgk...@gmail.com wrote:
 Hello,

 I have a rather simple query. I have a list where I have terms like and
 then my query is more natural language. I want to be able to retrieve
  matches that has atleast 2 words in common between the query and the index

 Can you guys suggest a Query Type and a field that I should be using?

 --
 Regards,
 *Deepak Gopalakrishnan*

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



FW: Binary indexing / query efficiency

2009-04-16 Thread Eger, Patrick
Resending, I think this got dropped by the list for some reason

-

Hi, was recently looking to incorporate Lucene for a simple
parametric/faceted type search.  The documents are very small,
roughly 15 fields of short length (5-15 characters, generally strings
and padded integers). When profiling query performance of our
application, which inserts 1 million documents then
 1) filters on 1-3 fields with simple boolean/term matches
 2) stores these docids in a BitSet
 3) calls IndexSearcher.doc() to retrieve all matching documents (all
fields, 100 - 1,000,000 results per call)

It turns out that 98% of the query time was spent not actually doing the
query, but within the IndexSearcher.doc() call.

My first question is, is there any way to more efficiently get
(all/most) of the fields for a set of documents, other than iterating
and calling doc()?

Additionally, is there any way (or planned feature) to index *binary*
data? Using a profiler, I have determined that String decoding is a
significant performance limiter for my use-case:

90% of the application time is spent in this method:
---
org.apache.lucene.index.FieldsReader.addField(Document, FieldInfo,
boolean, boolean, boolean)


46% of the application time is spent decoding strings (half of the above
addField() time):
---org.apache.lucene.store.IndexInpu
t.readString()
java.lang.String.init(byte[], int, int, String)
java.lang.StringCoding.decode(String, byte[], int, int)

java.lang.StringCoding$StringDecoder.decode(byte[], int, int)

(YJP profiler output available if needed)

String.intern() was my top hot spot, but my patch was accepted and fixed
this: https://issues.apache.org/jira/browse/LUCENE-1600. I'm not
familiar enough with the lucene codebase to figure out the above though,
so thought I would ask.



//ideally i'd be able to do add a binary field as such:
doc.add(new Field(f1,new
byte[]{1,2,3,4},Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

//then query like:
Query q = new TermQuery(new Term(f1,byte[]{1,2,3,4}))
searcher.search(q,...);

Which would allow me to avoid the Integer - String - Padded String -
String - Integer coding/decoding to index an integer, and avoid Object
- String - Object conversion (which per above is quite expensive). 


Thanks for any help!


Regards, 

Patrick

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Binary indexing / query efficiency

2009-04-14 Thread Eger, Patrick
Hi, was recently looking to incorporate Lucene for a simple
parametric/faceted type search.  The documents are very small,
roughly 15 fields of short length (5-15 characters, generally strings
and padded integers). When profiling query performance of our
application, which inserts 1 million documents then
 1) filters on 1-3 fields with simple boolean/term matches
 2) stores these docids in a BitSet
 3) calls IndexSearcher.doc() to retrieve all matching documents (all
fields, 100 - 1,000,000 results per call)

It turns out that 98% of the query time was spent not actually doing the
query, but within the IndexSearcher.doc() call.

My first question is, is there any way to more efficiently get
(all/most) of the fields for a set of documents, other than iterating
and calling doc()?

Additionally, is there any way (or planned feature) to index *binary*
data? Using a profiler, I have determined that String decoding is a
significant performance limiter for my use-case:

90% of the application time is spent in this method:
---
org.apache.lucene.index.FieldsReader.addField(Document, FieldInfo,
boolean, boolean, boolean)


46% of the application time is spent decoding strings (half of the above
addField() time):
---org.apache.lucene.store.IndexInpu
t.readString()
java.lang.String.init(byte[], int, int, String)
java.lang.StringCoding.decode(String, byte[], int, int)

java.lang.StringCoding$StringDecoder.decode(byte[], int, int)

(YJP profiler output available if needed)

String.intern() was my top hot spot, but my patch was accepted and fixed
this: https://issues.apache.org/jira/browse/LUCENE-1600. I'm not
familiar enough with the lucene codebase to figure out the above though,
so thought I would ask.



//ideally i'd be able to do add a binary field as such:
doc.add(new Field(f1,new
byte[]{1,2,3,4},Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

//then query like:
Query q = new TermQuery(new Term(f1,byte[]{1,2,3,4}))
searcher.search(q,...);

Which would allow me to avoid the Integer - String - Padded String -
String - Integer coding/decoding to index an integer, and avoid Object
- String - Object conversion (which per above is quite expensive). 


Thanks for any help!


Regards, 

Patrick

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Binary indexing / query efficiency

2009-04-14 Thread Khawaja Shams
Hi,
  It is not a good idea to extract each document. You can be more efficient
by only looking at the fields you are interested in. Depending on the size
of your index, you can try:

String[] codes = FieldCache.DEFAULT.getStrings(indexReader, fieldName);


This returns a string [] with the length being the number of documents in
your index. If you are doing faceted searching, you may want to try:

StringIndex stringIndex = FieldCache.DEFAULT.getStringIndex(indexReader,
fieldName);

The StringIndex class has a lookup array and an order array. The order array
contains a value for each document id, and you can use this value to extract
the string from the lookup array once you are done counting.


Perhaps the Lucene experts can shed light on a better approach.

You may also want to look at SOLR for faceted searching support :). HTH.


Regards,
Khawaja Shams

On Tue, Apr 14, 2009 at 11:12 AM, Eger, Patrick pe...@automotive.comwrote:

 Hi, was recently looking to incorporate Lucene for a simple
 parametric/faceted type search.  The documents are very small,
 roughly 15 fields of short length (5-15 characters, generally strings
 and padded integers). When profiling query performance of our
 application, which inserts 1 million documents then
  1) filters on 1-3 fields with simple boolean/term matches
  2) stores these docids in a BitSet
  3) calls IndexSearcher.doc() to retrieve all matching documents (all
 fields, 100 - 1,000,000 results per call)

 It turns out that 98% of the query time was spent not actually doing the
 query, but within the IndexSearcher.doc() call.

 My first question is, is there any way to more efficiently get
 (all/most) of the fields for a set of documents, other than iterating
 and calling doc()?

 Additionally, is there any way (or planned feature) to index *binary*
 data? Using a profiler, I have determined that String decoding is a
 significant performance limiter for my use-case:

 90% of the application time is spent in this method:
 ---
 org.apache.lucene.index.FieldsReader.addField(Document, FieldInfo,
 boolean, boolean, boolean)


 46% of the application time is spent decoding strings (half of the above
 addField() time):
 ---org.apache.lucene.store.IndexInpu
 t.readString()
java.lang.String.init(byte[], int, int, String)
java.lang.StringCoding.decode(String, byte[], int, int)

 java.lang.StringCoding$StringDecoder.decode(byte[], int, int)

 (YJP profiler output available if needed)

 String.intern() was my top hot spot, but my patch was accepted and fixed
 this: https://issues.apache.org/jira/browse/LUCENE-1600. I'm not
 familiar enough with the lucene codebase to figure out the above though,
 so thought I would ask.



 //ideally i'd be able to do add a binary field as such:
 doc.add(new Field(f1,new
 byte[]{1,2,3,4},Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

 //then query like:
 Query q = new TermQuery(new Term(f1,byte[]{1,2,3,4}))
 searcher.search(q,...);

 Which would allow me to avoid the Integer - String - Padded String -
 String - Integer coding/decoding to index an integer, and avoid Object
 - String - Object conversion (which per above is quite expensive).


 Thanks for any help!


 Regards,

 Patrick

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Binary indexing / query efficiency

2009-04-14 Thread eks dev

you can store binary value?
e.g. with: 
Field(String name, byte[] value, Field.Store store)

You could store all your fields as byte[], so you get them back as byte[]. How 
you index them is just another problem, but you are having no problems with 
speed in your case, leave it as it is.

try simply to create pairs of fields for each  field you now have, one Stored 
and not indexed and another  Indexed and not stored. Or Fields you use for 
searching as only indexed, and one big byte[]  field where you encode all your 
documents (Blob)... if complex, you could try protobuf, thrift... 

Anyhow, your idea with byte[] as indexed unit that can be searched unit is 
maybe not all that bad, but it does not look like you need it and is not an 
easy one to change (I guess).  


- Original Message 
 From: Eger, Patrick pe...@automotive.com
 To: java-user@lucene.apache.org
 Sent: Tuesday, 14 April, 2009 20:12:34
 Subject: Binary indexing / query efficiency
 
 Hi, was recently looking to incorporate Lucene for a simple
 parametric/faceted type search.  The documents are very small,
 roughly 15 fields of short length (5-15 characters, generally strings
 and padded integers). When profiling query performance of our
 application, which inserts 1 million documents then
 1) filters on 1-3 fields with simple boolean/term matches
 2) stores these docids in a BitSet
 3) calls IndexSearcher.doc() to retrieve all matching documents (all
 fields, 100 - 1,000,000 results per call)
 
 It turns out that 98% of the query time was spent not actually doing the
 query, but within the IndexSearcher.doc() call.
 
 My first question is, is there any way to more efficiently get
 (all/most) of the fields for a set of documents, other than iterating
 and calling doc()?
 
 Additionally, is there any way (or planned feature) to index *binary*
 data? Using a profiler, I have determined that String decoding is a
 significant performance limiter for my use-case:
 
 90% of the application time is spent in this method:
 ---
 org.apache.lucene.index.FieldsReader.addField(Document, FieldInfo,
 boolean, boolean, boolean)
 
 
 46% of the application time is spent decoding strings (half of the above
 addField() time):
 ---org.apache.lucene.store.IndexInpu
 t.readString()
 java.lang.String.(byte[], int, int, String)
 java.lang.StringCoding.decode(String, byte[], int, int)
 
 java.lang.StringCoding$StringDecoder.decode(byte[], int, int)
 
 (YJP profiler output available if needed)
 
 String.intern() was my top hot spot, but my patch was accepted and fixed
 this: https://issues.apache.org/jira/browse/LUCENE-1600. I'm not
 familiar enough with the lucene codebase to figure out the above though,
 so thought I would ask.
 
 
 
 //ideally i'd be able to do add a binary field as such:
 doc.add(new Field(f1,new
 byte[]{1,2,3,4},Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));
 
 //then query like:
 Query q = new TermQuery(new Term(f1,byte[]{1,2,3,4}))
 searcher.search(q,...);
 
 Which would allow me to avoid the Integer - String - Padded String -
 String - Integer coding/decoding to index an integer, and avoid Object
 - String - Object conversion (which per above is quite expensive). 
 
 
 Thanks for any help!
 
 
 Regards, 
 
 Patrick
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org